Skip to content

External allocations fail under high memory pressure #340

@denizyuret

Description

@denizyuret

This is a memory stability issue I was not able to crack during training of a largish model written by @mg9. To replicate, download test.zip, unzip and run julia in the resulting directory and include("train.jl"). The errors appear within 1-2 minutes. There are two problems as far as I can see:

The first problem is something we have known for a while (see e.g. denizyuret/Knet.jl#556): CUDA.jl grabs almost all GPU memory, CUDNN ends up complaining when it tries to allocate off-pool. At least I think this is what's going on because (1) it typically happens when CUDA.usage[] is close to GPU memory limit during expensive CUDNN calls, (2) it goes away if we use CUDA.usage_limit[]. Interestingly it results in an EXECUTION_FAILED error rather than out of memory:

julia> include("train.jl")
...
iteration 29

Stacktrace:
 [1] throw_api_error(::CUDA.CUDNN.cudnnStatus_t) at /userfiles/dyuret/.julia/dev/CUDA/lib/cudnn/error.jl:19
 [2] macro expansion at /userfiles/dyuret/.julia/dev/CUDA/lib/cudnn/error.jl:30 [inlined]
 [3] cudnnRNNForwardTraining(::Ptr{Nothing}, ::Knet.RD, ::Int64, ::Knet.TDs, ::KnetArray{Float32,3}, ::Ptr{Nothing}, ::CUDA.CuPtr{Nothing}, ::Ptr{Nothing}, ::CUDA.CuPtr{Nothing}, ::Knet.FD, ::KnetArray{Float32,3}, ::Knet.TDs, ::KnetArray{Float32,3}, ::Knet.TD, ::KnetArray{Float32,3}, ::Knet.TD, ::KnetArray{Float32,3}, ::KnetArray{UInt8,1}, ::Int64, ::KnetArray{UInt8,1}, ::Int64) at /userfiles/dyuret/.julia/dev/CUDA/lib/utils/call.jl:93
...
ERROR: LoadError: CUDNNError: CUDNN_STATUS_EXECUTION_FAILED (code 8)

For this problem if my hypothesis is correct we just need to figure out what is trying to grab memory and whether there is a way to give it memory from the pool. If not we just need to have usage_limit[] set below some safety limit at initialization.

Starting with a usage_limit[] solves the first problem and makes the second problem appear, which I could not make any progress on:

julia> using CUDA; CUDA.usage_limit[] = 10^10  # For a 12GB K80 card
julia> include("train.jl")
...
iteration 68

Stacktrace:
 [1] throw_api_error(::CUDA.cudaError_enum) at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/error.jl:103
 [2] macro expansion at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/error.jl:110 [inlined]
 [3] cuMemcpyDtoH_v2(::Ptr{Float32}, ::CuPtr{Nothing}, ::Int64) at /userfiles/dyuret/.julia/dev/CUDA/lib/utils/call.jl:93
 [4] _unsafe_copy!(::Array{Float32,2}, ::Int64, ::KnetArray{Float32,2}, ::Int64, ::Int64) at /dev/shm/dyuret/.julia/packages/Knet/exwCE/src/karray.jl:359
 [5] convert at /dev/shm/dyuret/.julia/packages/Knet/exwCE/src/karray.jl:120 [inlined]
 [6] convert(::Type{Array{Int32,N} where N}, ::KnetArray{Float32,2}) at /dev/shm/dyuret/.julia/packages/Knet/exwCE/src/karray.jl:119
 [7] (::DeepBiaffineGraphDecoder)(::AutoGrad.Result{KnetArray{Float32,3}}, ::Array{Int64,2}, ::Array{Int64,2}, ::Array{Int64,2}) at /scratch/users/dyuret/mrp/test/deep_biaffine_graph_decoder.jl:92
...
ERROR: LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)

Unfortunately this error does not always occur in the same iteration or on the same function as is typical with memory errors. So I tried running julia with cuda-memcheck:

$ cuda-memcheck julia
julia> using CUDA; CUDA.usage_limit[] = 10^10  # For a 12GB K80 card
julia> include("train.jl")
...
iteration 2
========= Invalid __global__ read of size 4
...
Stacktrace:
 [1] throw_api_error(::CUDA.cudaError_enum) at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/error.jl:103
 [2] macro expansion at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/error.jl:110 [inlined]
 [3] cuMemAlloc_v2(::Base.RefValue{CuPtr{Nothing}}, ::Int64) at /userfiles/dyuret/.julia/dev/CUDA/lib/utils/call.jl:93
 [4] alloc at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/memory.jl:84 [inlined]
...
ERROR: LoadError: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)

This one I get fairly consistently with cuda-memcheck. Any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcuda librariesStuff about CUDA library wrappers.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions