Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eachobsparallel CUDA Error while freeing... #161

Closed
nikopj opened this issue Jul 13, 2023 · 4 comments · Fixed by JuliaGPU/CUDA.jl#2029
Closed

eachobsparallel CUDA Error while freeing... #161

nikopj opened this issue Jul 13, 2023 · 4 comments · Fixed by JuliaGPU/CUDA.jl#2029

Comments

@nikopj
Copy link

nikopj commented Jul 13, 2023

There seems to be a bug when using a parallel dataloader and transfering to GPU. It's a bit difficult to reproduce / not consistent every run (bc of multithreading I suppose). It seems to involve heavy FileIO + CUDA in a for loop. I've narrowed it down to using eachobsparallel and being a function of batchsize and the number of threads. If the batchsize is not sizeably larger than threads (~x2), then the CUDA free error pops up within 1-3 dataloops.

In my tests, the MWE (dl_test.jl, below) produces an error according to this table:

nthreads batchsize executor result
2 1 ThreadedEx works
2 2 ThreadedEx works
2 4 ThreadedEx works
2 8 ThreadedEx works
4 1 ThreadedEx works
4 2 ThreadedEx works
4 4 ThreadedEx FAILS
4 8 ThreadedEx FAILS
4 16 ThreadedEx works
8 8 ThreadedEx FAILS
8 16 ThreadedEx works
-------------- ------------- ------------------ -----------
2 1 TaskPoolEx works
2 2 TaskPoolEx works
2 4 TaskPoolEx works
2 8 TaskPoolEx works
4 1 TaskPoolEx works
4 2 TaskPoolEx works
4 4 TaskPoolEx works
4 8 TaskPoolEx works
4 16 TaskPoolEx works
8 8 TaskPoolEx works
8 16 TaskPoolEx works

This is using a 16 core CPU with 64 GBs of memory.

(dl_test.jl)

using MLUtils, CUDA

using FLoops
using FLoops.Transducers: ThreadedEx
using FoldsThreads: TaskPoolEx
import Base: length, getindex

BATCHSIZE = parse(Int, ARGS[1])

# Dummy Dataset
struct DummyDS
    num
end
function getindex(data::DummyDS, idx::Int)
    return randn(Float32, 128, 128, 3)
end
length(data::DummyDS) = data.num

ds = MLUtils.BatchView(DummyDS(5000); batchsize=BATCHSIZE, partial=false, collate=true)
dl = MLUtils.eachobsparallel(ds; executor=ThreadedEx())

function dummyloss(x)
    y = randn_like(x)
    return sum(abs, x - y)
end

function data_loop(loader)
    loss = 0
    for x in loader
        x = cu(x)
        loss += dummyloss(x)
        CUDA.unsafe_free!(x)
    end
    return nothing
end

for i = 1:20
    @time data_loop(dl)
end

Here's the accomanying error (for example when I run julia --project -t 8 dl_test.jl 8). This same error repeats many many many times.

WARNING: Error while freeing DeviceBuffer(1.500 MiB at 0x000014ca65000000):
UndefRefError()

Stacktrace:
  [1] current_device
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/devices.jl:24 [inlined]
  [2] #_free#998
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:485 [inlined]
  [3] _free
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:479 [inlined]
  [4] macro expansion
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:464 [inlined]
  [5] macro expansion
    @ ./timing.jl:393 [inlined]
  [6] #free#997
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:463 [inlined]
  [7] free
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:452 [inlined]
  [8] (::CUDA.var"#1004#1005"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuStream})()
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:130
  [9] #context!#887
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:170 [inlined]
 [10] context!
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:165 [inlined]
 [11] unsafe_free!(xs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:129
 [12] unsafe_finalize!(xs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:150
 [13] Array
    @ ./boot.jl:477 [inlined]
 [14] getindex
    @ ./array.jl:400 [inlined]
 [15] show_datatype
    @ ./show.jl:1058 [inlined]
 [16] _show_type(io::IOContext{IOBuffer}, x::Type)
    @ Base ./show.jl:958
 [17] show(io::IOContext{IOBuffer}, x::Type)
    @ Base ./show.jl:950
 [18] show_typeparams(io::IOContext{IOBuffer}, env::Core.SimpleVector, orig::Core.SimpleVector, wheres::Vector{TypeVar})
    @ Base ./show.jl:707
 [19] show_datatype(io::IOContext{IOBuffer}, x::DataType, wheres::Vector{TypeVar})
    @ Base ./show.jl:1092
--- the last 5 lines are repeated 4 more times ---
 [40] show_datatype
    @ ./show.jl:1058 [inlined]
 [41] _show_type(io::IOContext{IOBuffer}, x::Type)
    @ Base ./show.jl:958
 [42] show(io::IOContext{IOBuffer}, x::Type)
    @ Base ./show.jl:950
 [43] print(io::IOContext{IOBuffer}, x::Type)
    @ Base ./strings/io.jl:35
 [44] print(::IOContext{IOBuffer}, ::String, ::Type, ::Vararg{Any})
    @ Base ./strings/io.jl:46
 [45] #with_output_color#962
    @ ./util.jl:76
 [46] printstyled(::IOContext{Core.CoreSTDOUT}, ::String, ::Vararg{Any}; bold::Bool, underline::Bool, blink::Bool, reverse::Bool, hidden::Bool, color::Symbol)
    @ Base ./util.jl:130
 [47] #print_within_stacktrace#538
    @ ./show.jl:2435
 [48] print_within_stacktrace
    @ ./show.jl:2433 [inlined]
 [49] show_signature_function
    @ ./show.jl:2427
 [50] #show_tuple_as_call#539
    @ ./show.jl:2459
 [51] show_tuple_as_call
    @ ./show.jl:2441 [inlined]
 [52] show_spec_linfo
    @ ./stacktraces.jl:244
 [53] print_stackframe
    @ ./errorshow.jl:730
 [54] print_stackframe
    @ ./errorshow.jl:695
 [55] #show_full_backtrace#921
    @ ./errorshow.jl:594
 [56] show_full_backtrace
    @ ./errorshow.jl:587 [inlined]
 [57] show_backtrace
    @ ./errorshow.jl:791
 [58] #free#997
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:473 [inlined]
 [59] free
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:452 [inlined]
 [60] (::CUDA.var"#1004#1005"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuStream})()
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:130
 [61] #context!#887
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:170 [inlined]
 [62] context!
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:165 [inlined]
 [63] unsafe_free!(xs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:129
 [64] unsafe_finalize!(xs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:150
 [65] Array
    @ ./boot.jl:489 [inlined]
 [66] similar
    @ ./array.jl:374 [inlined]
 [67] similar
    @ ./abstractarray.jl:838 [inlined]
 [68] _typed_stack(::Colon, ::Type{Float32}, ::Type{Array{Float32, 3}}, A::Vector{Array{Float32, 3}}, Aax::Tuple{Base.OneTo{Int64}})
    @ Base ./abstractarray.jl:2797
 [69] _typed_stack
    @ ./abstractarray.jl:2793 [inlined]
 [70] _stack
    @ ./abstractarray.jl:2783 [inlined]
 [71] _stack
    @ ./abstractarray.jl:2775 [inlined]
 [72] #stack#178
    @ ./abstractarray.jl:2743 [inlined]
 [73] stack
    @ ./abstractarray.jl:2743 [inlined]
 [74] batch
    @ /scratch/npj226/.julia/dev/MLUtils/src/utils.jl:367 [inlined]
 [75] _getbatch(A::BatchView{Array{Float32, 4}, DummyDS, Val{true}}, obsindices::UnitRange{Int64})
    @ MLUtils /scratch/npj226/.julia/dev/MLUtils/src/batchview.jl:138
 [76] getindex
    @ /scratch/npj226/.julia/dev/MLUtils/src/batchview.jl:129 [inlined]
 [77] getobs(::Type{SimpleTraits.Not{MLUtils.IsTable{BatchView{Array{Float32, 4}, DummyDS, Val{true}}}}}, data::BatchView{Array{Float32, 4}, DummyDS, Val{true}}, idx::Int64)
    @ MLUtils /scratch/npj226/.julia/dev/MLUtils/src/observation.jl:110
 [78] getobs
    @ /scratch/npj226/.julia/packages/SimpleTraits/l1ZsK/src/SimpleTraits.jl:331 [inlined]
 [79] (::MLUtils.var"#58#59"{BatchView{Array{Float32, 4}, DummyDS, Val{true}}})(ch::Channel{Any}, i::Int64)
    @ MLUtils /scratch/npj226/.julia/dev/MLUtils/src/parallel.jl:66
 [80] macro expansion
    @ /scratch/npj226/.julia/dev/MLUtils/src/parallel.jl:124 [inlined]
 [81] ##reducing_function#293#68
    @ /scratch/npj226/.julia/packages/FLoops/6PVny/src/reduce.jl:817 [inlined]
 [82] (::InitialValues.AdjoinIdentity{MLUtils.var"##reducing_function#293#68"{MLUtils.Loader, Channel{Any}}})(x::Tuple{}, y::Int64)
    @ InitialValues /scratch/npj226/.julia/packages/InitialValues/OWP8V/src/InitialValues.jl:306
 [83] next
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/combinators.jl:290 [inlined]
 [84] next
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/core.jl:289 [inlined]
 [85] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/core.jl:181 [inlined]
 [86] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:199 [inlined]
 [87] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/simd.jl:41 [inlined]
 [88] _foldl_linear_bulk
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:198 [inlined]
 [89] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:192 [inlined]
 [90] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/basics.jl:115 [inlined]
 [91] _foldl_array
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:188 [inlined]
 [92] __foldl__
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:182 [inlined]
 [93] foldl_basecase
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:365 [inlined]
 [94] _reduce_basecase(rf::Transducers.BottomRF{Transducers.AdHocRF{MLUtils.var"##oninit_function#292#67", typeof(identity), InitialValues.AdjoinIdentity{MLUtils.var"##reducing_function#293#68"{MLUtils.Loader, Channel{Any}}}, typeof(identity), typeof(identity), MLUtils.var"##combine_function#294#69"}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
    @ Transducers /scratch/npj226/.julia/packages/Transducers/yTXrD/src/threading_utils.jl:58
 [95] _reduce(ctx::Transducers.NoopDACContext, rf::Transducers.BottomRF{Transducers.AdHocRF{MLUtils.var"##oninit_function#292#67", typeof(identity), InitialValues.AdjoinIdentity{MLUtils.var"##reducing_function#293#68"{MLUtils.Loader, Channel{Any}}}, typeof(identity), typeof(identity), MLUtils.var"##combine_function#294#69"}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
    @ Transducers /scratch/npj226/.julia/packages/Transducers/yTXrD/src/reduce.jl:150

Heres the output of CUDA.versioninfo() for reference:

CUDA runtime 11.8, artifact installation
CUDA driver 11.8
NVIDIA driver 520.61.5

CUDA libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+520.61.5

Julia packages:
- CUDA: 4.4.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0

Toolchain:
- Julia: 1.9.2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

Environment:
- JULIA_CUDA_MEMORY_POOL: none

1 device:
  0: Quadro RTX 8000 (sm_75, 44.485 GiB / 45.000 GiB available)

And the package versions I'm using (] status):

  [052768ef] CUDA v4.4.0
  [cc61a311] FLoops v0.2.1
  [9c68100b] FoldsThreads v0.1.2
  [f1d291b0] MLUtils v0.4.3 
@nikopj nikopj changed the title Dataloader parallel CUDA Error while freeing... eachobsparallel CUDA Error while freeing... Jul 14, 2023
@nikopj
Copy link
Author

nikopj commented Jul 14, 2023

Note that the hanging issue described in #142 is still present with TaskPoolEx, but at least it runs!

@ToucheSir
Copy link
Contributor

I was very confused at first, but It appears the actual error is masked by the catch block handling in https://github.com/JuliaGPU/CUDA.jl/blob/v4.4.0/src/pool.jl#L472-L474, which errors when trying to print the stacktrace. Can you change that to rethrow() instead or remove the catch block entirely to see what the root error is?

@ToucheSir
Copy link
Contributor

I did a good deal more digging on this, and after asking around it seems to be an issue on the CUDA.jl side. Will update this issue with more details as I get them.

@nikopj
Copy link
Author

nikopj commented Oct 26, 2023

This appears to be fixed on my end now with the upgraded CUDA version!

@nikopj nikopj closed this as completed Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants