Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悶 dagger v0.11.2 possible memory leak #7258

Closed
kjuulh opened this issue May 2, 2024 · 7 comments
Closed

馃悶 dagger v0.11.2 possible memory leak #7258

kjuulh opened this issue May 2, 2024 · 7 comments
Assignees
Labels
kind/bug Something isn't working

Comments

@kjuulh
Copy link
Contributor

kjuulh commented May 2, 2024

What is the issue?

We've seen that since bumping to 0.11 + some of our containers are facing out of memory issues. Currently our dagger engines sit at around 7GB of memory, when before 0.11 they were at 2GB. It continues to climp until kubernetes choses to preempty the pods.

We initially saw it because quite a few barebones debian:12.5-slim execs with apt-get update + apt-get install -y ca-certificates gave a 137 exit code.

Which from what I could gather is when buildkit sends a SIGKILL to the process because it runs out of memory. As 0.10 -> 0.11 was quite a big change, it is possibly quite difficult to find the regression.

apt-get is also cut off somewhere when it runs, it isn't always in the start, it can be during the Reading package lists, or when it is about to finish. It isn't entirely consistent. I am also quite unsure why buildkit only kills this exact process

We get a few every hour.

        event_timestamp (utc)
-------------------------------
 2024-05-02 14:12:29
 2024-05-02 14:11:43
 2024-05-02 14:02:04
 2024-05-02 14:00:00
 2024-05-02 13:56:47
 2024-05-02 13:34:14
 2024-05-02 13:34:14
 2024-05-02 13:21:07
 2024-05-02 13:19:08

Dagger version

dagger v0.11.2

Steps to reproduce

We simply let our CI system run for a few hours (probably around 100-200 builds of our golang build pipeline).

Log output

panic: input: pipeline.container.from.withExec.withExec.file resolve: process "apt-get install -y ca-certificates" did not complete successfully: exit code: 137
@kjuulh kjuulh added the kind/bug Something isn't working label May 2, 2024
@Scalahansolo
Copy link

This actually sounds like something really similar to what we've been experiencing with our Dagger setup in K8s. Our team has been really vocal with the Dagger people and I believe they are on this.

It's a bummer this is more wide spread, but it gives me some solace that it's not just my K8s setup seeing this issue.

@sipsma
Copy link
Contributor

sipsma commented May 4, 2024

Gonna take a look at this to see if we can repro and squeeze in a fix for next release 馃

@sipsma sipsma added this to the v0.11.3 milestone May 4, 2024
@sipsma sipsma self-assigned this May 4, 2024
@sipsma
Copy link
Contributor

sipsma commented May 4, 2024

Can repro. Found one cause which is a goroutine leak upstream, fix here: moby/buildkit#4902 cc @jedevc

Re-running (reproing by executing full module test suite) with that fix in place does fix the goroutine leak and help the memory usage somewhat, but the RSS is still flat at ~1GB after everything has disconnected, which is still too high.

  • Interestingly, getting a heap profile w/ pprof shows only 298MB of memory used? But the 1GB RSS doesn't go down even when manually triggering Go's GC.
  • I'm gonna check if we are leaking memory outside of the go runtime (e.g. I think if we are leaving around a Linux pipe with data that would count towards our RSS? And similar sort of things)
  • I also have 32GB of RAM so I need to double check if the go runtime will leave memory allocated from the OS's perspective even when it's released objects in the case where the system has lots of RAM available.

@sipsma
Copy link
Contributor

sipsma commented May 4, 2024

It does seem that at least part of the remaining RSS is just due to Linux not returning memory pages until under memory pressure: golang/go#39779 (comment)

  • When I manually allocate a bunch of memory the RSS of the engine goes down to like 700ish MB

That being said, looking at the heap usage after re-running tests repeatedly does show some usage creeping up (though pretty slowly). One heap snapshot:

(pprof) top50
Showing nodes accounting for 200.41MB, 83.92% of 238.81MB total
Dropped 350 nodes (cum <= 1.19MB)
Showing top 50 nodes out of 280
      flat  flat%   sum%        cum   cum%
   36.71MB 15.37% 15.37%    36.71MB 15.37%  github.com/moby/buildkit/solver/pb.(*FileActionMkFile).Unmarshal
   28.05MB 11.74% 27.12%    29.05MB 12.16%  encoding/json.(*decodeState).literalStore
   21.65MB  9.07% 36.18%    21.65MB  9.07%  google.golang.org/protobuf/internal/impl.consumeStringValidateUTF8
   11.83MB  4.95% 41.14%    22.83MB  9.56%  github.com/dagger/dagger/core.(*ModDeps).lazilyLoadSchema
      10MB  4.19% 45.33%       10MB  4.19%  runtime.malg
    6.31MB  2.64% 47.97%    12.95MB  5.42%  github.com/dagger/dagger/core.(*ModDeps).SchemaIntrospectionJSON
    6.28MB  2.63% 50.60%     6.28MB  2.63%  encoding/json.Marshal
       6MB  2.51% 53.11%        7MB  2.93%  github.com/moby/buildkit/cache/metadata.NewValue
    5.50MB  2.30% 55.41%    12.27MB  5.14%  go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.newConfig
       5MB  2.09% 57.51%        5MB  2.09%  github.com/moby/buildkit/client/llb.mergeMetadata
    4.50MB  1.89% 59.39%     4.50MB  1.89%  github.com/moby/buildkit/cache.(*cacheRecord).ref
    4.50MB  1.88% 61.28%     4.50MB  1.88%  strings.(*Builder).grow
    3.50MB  1.47% 62.74%     3.50MB  1.47%  github.com/dagger/dagger/dagql.Class[go.shape.*uint8].Install
    3.27MB  1.37% 64.11%     3.27MB  1.37%  go.opentelemetry.io/otel/internal/global.(*meter).Int64Histogram
    3.07MB  1.28% 65.40%     3.07MB  1.28%  go.etcd.io/bbolt.(*freelist).arrayAllocate
       3MB  1.26% 66.65%        5MB  2.09%  github.com/moby/buildkit/solver.(*cacheManager).newKeyWithID (inline)
    2.87MB  1.20% 67.86%     2.87MB  1.20%  github.com/moby/buildkit/solver/pb.(*Op).Marshal
    2.06MB  0.86% 68.72%     2.06MB  0.86%  google.golang.org/grpc/internal/transport.newBufWriter
       2MB  0.84% 69.56%        3MB  1.26%  github.com/moby/buildkit/cache/metadata.newStorageItem.func1
       2MB  0.84% 70.40%        2MB  0.84%  github.com/moby/buildkit/cache/metadata.(*StorageItem).setValue
       2MB  0.84% 71.23%        2MB  0.84%  fmt.Sprintf
       2MB  0.84% 72.07%        2MB  0.84%  github.com/moby/buildkit/solver.newKey (inline)
       2MB  0.84% 72.91%        2MB  0.84%  reflect.New
       2MB  0.84% 73.75%        2MB  0.84%  go.opentelemetry.io/otel/metric.WithDescription (inline)
    1.55MB  0.65% 74.40%     8.55MB  3.58%  github.com/moby/buildkit/cache.(*mutableRef).commit
    1.50MB  0.63% 75.02%     1.50MB  0.63%  go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).ensureAttributesCapacity
    1.50MB  0.63% 75.65%     1.50MB  0.63%  github.com/moby/buildkit/cache.(*cacheRecord).layerDigestChain
    1.50MB  0.63% 76.28%        2MB  0.84%  github.com/dagger/dagger/dagql.inputSpecsForType
    1.50MB  0.63% 76.91%        2MB  0.84%  github.com/moby/buildkit/util/flightcontrol.(*Group[go.shape.int64]).do
    1.50MB  0.63% 77.54%     1.50MB  0.63%  github.com/moby/buildkit/solver/pb.(*SourceOp).Unmarshal
    1.50MB  0.63% 78.17%     1.50MB  0.63%  github.com/moby/buildkit/solver/bboltcachestorage.(*Store).WalkLinks.func1
    1.50MB  0.63% 78.79%     1.50MB  0.63%  go.opentelemetry.io/otel/metric.WithUnit (inline)
    1.23MB  0.52% 79.31%     1.23MB  0.52%  go.etcd.io/bbolt.(*freelist).free
       1MB  0.42% 79.73%     7.50MB  3.14%  github.com/moby/buildkit/solver.(*edge).probeCache
       1MB  0.42% 80.15%     1.50MB  0.63%  github.com/dagger/dagger/dagql.(*Server).installObjectLocked
       1MB  0.42% 80.57%     8.87MB  3.71%  github.com/moby/buildkit/client/llb.marshal
       1MB  0.42% 80.99%        2MB  0.84%  github.com/moby/buildkit/solver/llbsolver.newVertex
       1MB  0.42% 81.41%    42.22MB 17.68%  github.com/moby/buildkit/solver/pb.(*Op).Unmarshal
    0.51MB  0.21% 81.62%     8.78MB  3.68%  github.com/moby/buildkit/session.(*Manager).handleConn
    0.50MB  0.21% 81.83%     1.50MB  0.63%  github.com/dagger/dagger/dagql.Class[go.shape.*uint8].TypeDefinition
    0.50MB  0.21% 82.04%        5MB  2.09%  github.com/dagger/dagger/core.NewRoot
    0.50MB  0.21% 82.25%    10.37MB  4.34%  github.com/moby/buildkit/client/llb.State.Marshal
    0.50MB  0.21% 82.46%     2.50MB  1.05%  encoding/json.Unmarshal
    0.50MB  0.21% 82.66%        2MB  0.84%  github.com/moby/buildkit/solver/pb.(*ExecOp).Unmarshal
    0.50MB  0.21% 82.87%     3.50MB  1.47%  github.com/moby/buildkit/cache/metadata.newStorageItem
    0.50MB  0.21% 83.08%        4MB  1.68%  github.com/moby/buildkit/cache.(*cacheManager).getRecord
    0.50MB  0.21% 83.29%     5.50MB  2.30%  github.com/moby/buildkit/cache.(*cacheMetadata).queueValue
    0.50MB  0.21% 83.50%     1.50MB  0.63%  github.com/moby/buildkit/solver.(*Solver).loadUnlocked
    0.50MB  0.21% 83.71%     2.50MB  1.05%  github.com/moby/buildkit/solver.(*edge).createInputRequests
    0.50MB  0.21% 83.92%    29.55MB 12.37%  encoding/json.(*decodeState).object

I think some of it may be from Go pools (e.g. the json literal store), which is fine since that should still be freeable when needed (afaik, worth confirmation).

But I am a bit confused why FileActionMkFile.Unmarshal and lazilyLoadSchema seem to creep up a few MB everytime I run the tests (followed by manual trigger of the GC each time).

Overall, I suspect that the worst of the leak is fixed by that upstream PR mentioned in previous comment, but there does still seem to be some stuff to look into.

@jedevc
Copy link
Member

jedevc commented May 7, 2024

The worst parts of this look like they should ideally be solved by #7295. I'm going to take this out of the milestone (since we've now merged that PR), but at least leave this open until the upstream issue has merged.

@jedevc jedevc removed this from the v0.11.3 milestone May 7, 2024
@kjuulh
Copy link
Contributor Author

kjuulh commented May 13, 2024

Awesome @sipsma

@jedevc
Copy link
Member

jedevc commented May 13, 2024

Aha thanks for bumping this @kjuulh - the upstream fix for this is merged now, so closing this 馃帀

If you find any more performance regressions like this, we can re-open / open a new issue, whatever works! Thanks!

@jedevc jedevc closed this as completed May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants