-
Notifications
You must be signed in to change notification settings - Fork 258
Description
This is a memory stability issue I was not able to crack during training of a largish model written by @mg9. To replicate, download test.zip, unzip and run julia in the resulting directory and include("train.jl"). The errors appear within 1-2 minutes. There are two problems as far as I can see:
The first problem is something we have known for a while (see e.g. denizyuret/Knet.jl#556): CUDA.jl grabs almost all GPU memory, CUDNN ends up complaining when it tries to allocate off-pool. At least I think this is what's going on because (1) it typically happens when CUDA.usage[] is close to GPU memory limit during expensive CUDNN calls, (2) it goes away if we use CUDA.usage_limit[]. Interestingly it results in an EXECUTION_FAILED error rather than out of memory:
julia> include("train.jl")
...
iteration 29
Stacktrace:
[1] throw_api_error(::CUDA.CUDNN.cudnnStatus_t) at /userfiles/dyuret/.julia/dev/CUDA/lib/cudnn/error.jl:19
[2] macro expansion at /userfiles/dyuret/.julia/dev/CUDA/lib/cudnn/error.jl:30 [inlined]
[3] cudnnRNNForwardTraining(::Ptr{Nothing}, ::Knet.RD, ::Int64, ::Knet.TDs, ::KnetArray{Float32,3}, ::Ptr{Nothing}, ::CUDA.CuPtr{Nothing}, ::Ptr{Nothing}, ::CUDA.CuPtr{Nothing}, ::Knet.FD, ::KnetArray{Float32,3}, ::Knet.TDs, ::KnetArray{Float32,3}, ::Knet.TD, ::KnetArray{Float32,3}, ::Knet.TD, ::KnetArray{Float32,3}, ::KnetArray{UInt8,1}, ::Int64, ::KnetArray{UInt8,1}, ::Int64) at /userfiles/dyuret/.julia/dev/CUDA/lib/utils/call.jl:93
...
ERROR: LoadError: CUDNNError: CUDNN_STATUS_EXECUTION_FAILED (code 8)
For this problem if my hypothesis is correct we just need to figure out what is trying to grab memory and whether there is a way to give it memory from the pool. If not we just need to have usage_limit[] set below some safety limit at initialization.
Starting with a usage_limit[] solves the first problem and makes the second problem appear, which I could not make any progress on:
julia> using CUDA; CUDA.usage_limit[] = 10^10 # For a 12GB K80 card
julia> include("train.jl")
...
iteration 68
Stacktrace:
[1] throw_api_error(::CUDA.cudaError_enum) at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/error.jl:103
[2] macro expansion at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/error.jl:110 [inlined]
[3] cuMemcpyDtoH_v2(::Ptr{Float32}, ::CuPtr{Nothing}, ::Int64) at /userfiles/dyuret/.julia/dev/CUDA/lib/utils/call.jl:93
[4] _unsafe_copy!(::Array{Float32,2}, ::Int64, ::KnetArray{Float32,2}, ::Int64, ::Int64) at /dev/shm/dyuret/.julia/packages/Knet/exwCE/src/karray.jl:359
[5] convert at /dev/shm/dyuret/.julia/packages/Knet/exwCE/src/karray.jl:120 [inlined]
[6] convert(::Type{Array{Int32,N} where N}, ::KnetArray{Float32,2}) at /dev/shm/dyuret/.julia/packages/Knet/exwCE/src/karray.jl:119
[7] (::DeepBiaffineGraphDecoder)(::AutoGrad.Result{KnetArray{Float32,3}}, ::Array{Int64,2}, ::Array{Int64,2}, ::Array{Int64,2}) at /scratch/users/dyuret/mrp/test/deep_biaffine_graph_decoder.jl:92
...
ERROR: LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Unfortunately this error does not always occur in the same iteration or on the same function as is typical with memory errors. So I tried running julia with cuda-memcheck:
$ cuda-memcheck julia
julia> using CUDA; CUDA.usage_limit[] = 10^10 # For a 12GB K80 card
julia> include("train.jl")
...
iteration 2
========= Invalid __global__ read of size 4
...
Stacktrace:
[1] throw_api_error(::CUDA.cudaError_enum) at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/error.jl:103
[2] macro expansion at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/error.jl:110 [inlined]
[3] cuMemAlloc_v2(::Base.RefValue{CuPtr{Nothing}}, ::Int64) at /userfiles/dyuret/.julia/dev/CUDA/lib/utils/call.jl:93
[4] alloc at /userfiles/dyuret/.julia/dev/CUDA/lib/cudadrv/memory.jl:84 [inlined]
...
ERROR: LoadError: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)
This one I get fairly consistently with cuda-memcheck. Any ideas?