Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden memory leak when training on GPU over many epochs #736

Open
aterenin opened this Issue Apr 15, 2019 · 0 comments

Comments

Projects
None yet
1 participant
@aterenin
Copy link

commented Apr 15, 2019

Self-contained MWE below. On my system, this will run for about 40 epochs with the following output from nvidia-smi.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     50425      C   julia-1.1.0/bin/julia                       1713MiB |
+-----------------------------------------------------------------------------+

And then suddenly will give the following output.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     44870      C   julia-1.1.0/bin/julia                      11997MiB |
+-----------------------------------------------------------------------------+

I stress that the shift in memory usage is sudden - the model will use ~2gb over many epochs, before deciding that it needs to immediately fill all available memory. Note that PIDs above are different because I killed the first process and restarted training.

This behavior occurs in other networks and prevents me from training large-scale models overnight, and also prevents other people from using the same machine for their tasks. I do not know Flux/Tracker/CuArrays internals enough to speculate on what might be going on, but would love to help debug.

using CuArrays
using Flux
using Flux: train!, onehotbatch, @epochs
using MLDatasets

function make_batches(data::T, batch_size::Integer)::Array{T} where T <: Tuple{AbstractArray,AbstractArray}
  batches = Vector{T}()
  idx_shuffled = Flux.Random.randperm(size(data[1])[end])
  for idx in Iterators.partition(idx_shuffled, batch_size)
    x = selectdim(data[1], length(size(data[1])), idx)
    y = selectdim(data[2], length(size(data[2])), idx)
    push!(batches, (x,y))
  end
  batches
end

data = MNIST.traindata() |>
  x -> (reshape(Array{Float32}(x[1]),(28,28,1,:)),x[2]) |>
  x -> make_batches(x,16) |>
  x -> map(y->(y[1], onehotbatch(y[2],0:9)),x) |>
  x -> map(y->gpu.(y),x)

CuArrays.allowscalar(false)

network = Chain(
    Conv((3,3),1=>16,relu;pad=(1,1)),
    BatchNorm(16),
    Conv((3,3),16=>16;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),16=>32,relu;pad=(1,1)),
    BatchNorm(32),
    Conv((3,3),32=>32;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),32=>64,relu;pad=(1,1)),
    BatchNorm(64),
    Conv((3,3),64=>64;pad=(1,1)),
    MeanPool((3,3); pad=(1,1)),
    Conv((3,3),64=>128,relu;pad=(1,1)),
    BatchNorm(128),
    Conv((3,3),128=>128;pad=(1,1)),
    MeanPool((4,4); pad=(1,1)),
    x -> reshape(x,(:,size(x)[end])),
    Dense(128,10),
    softmax
  ) |> gpu

function loss(x::AbstractArray, y::AbstractArray)::Real
  y_predicted = network(x)
  R = 0.01f0 * sum(sum(w.^2) for w in params(network))
  Flux.crossentropy(y_predicted, y) + R
end

optimizer = ADAM()

@epochs 1000 train!(loss, params(network), data, optimizer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.