Skip to content

Apparent memory leak when using Distributed? #2102

@jarbus

Description

@jarbus

Package Version

Status ~/.julia/environments/v1.8/Project.toml [587475ba] Flux v0.13.8 https://github.com/FluxML/Flux.jl.git#master

Julia Version

julia version 1.8.2

OS / Environment

OS: Arch Linux x86_64
Kernel: 6.0.6-arch1-1

Describe the bug

Flux appears to use an increasing amount of memory when it's reconstruction function and Adam optimizer is used with Distributed.jl.

Steps to Reproduce

Warning: This will quickly eat up your ram
I've identified four conditions must be met in order for this leak to occur:

  1. Must be using Distributed
  2. Must be using an optimizer
  3. Must be running update! on a worker
  4. Must include reconstruction on a worker
using Distributed
addprocs(1)
@everywhere begin
  using Flux
  opt = Adam()
  theta, re = Chain(Dense(1000 => 1000, tanh)) |> Flux.destructure
end

println("Beginning loop")

for i in 1:1000
  println(i)

  # Note: bug doesn't occur if this remotecall is on thread 1
  fetch(remotecall(2) do
    # Note: bug doesn't occur if you remove below line
    re(theta)
    1
  end)

  @everywhere begin
    grad = theta * 0.01
    # Note: bug doesn't occur if you switch out update! with this line
    #theta .+= grad  
    Flux.Optimise.update!(opt, theta, grad)
  end
end

Expected Results

This bit of code should run to completion, using a relatively consistent amount of memory during each iteration of the loop, as nothing is saved during each loop.

Observed Results

Memory usage quickly increases indefinitely

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions