-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskPoolEx
leads to unreliable Dataloaders
#142
Comments
Also the comment claims that thread pools are not used anyways, so there seems to be some confusion one way or the other. Perhaps related #33 ? Lines 72 to 76 in ff2fcc1
|
That comment must be outdated 😅 Before moving back to
What have been your experiences, if any, running GPU training with |
I would assume this is easily fixed by running e.g. |
Note also what it says in the
|
Regarding the GPU utilization question, my mental model tells me this:
Either way, benchmarking it would probably be a good idea, but I would expect differerent results depending on the batch size, data type, gpu model, etc. |
I think I encountered this same problem. Attempting to use DataLoader(...; parallel = true ) in some cases will result in it hanging. Reverting to ThreadedEx fixes the problem. Modifying the example in docs results in this MWE: using MLUtils
Xtrain = rand(10, 100);
array_loader = DataLoader(Xtrain; batchsize=2, parallel=true);
first(array_loader) # <--- this will hang forever. A smaller data size does work for me using MLUtils
Xtrain = rand(10, 50);
array_loader = DataLoader(Xtrain; batchsize=2, parallel=true);
first(array_loader) Stacktrace from interrupting the hanging DataLoader julia> first(array_loader)
^CERROR: InterruptException:
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:871
[2] wait()
@ Base ./task.jl:931
[3] wait(c::Base.GenericCondition{ReentrantLock})
@ Base ./condition.jl:124
[4] take_buffered(c::Channel{Any})
@ Base ./channels.jl:416
[5] take!(c::Channel{Any})
@ Base ./channels.jl:410
[6] iterate(#unused#::MLUtils.Loader, state::MLUtils.LoaderState)
@ MLUtils ~/.julia/packages/MLUtils/R44Zf/src/parallel.jl:140
[7] iterate(loader::MLUtils.Loader)
@ MLUtils ~/.julia/packages/MLUtils/R44Zf/src/parallel.jl:132
[8] iterate(e::DataLoader{Matrix{Float64}, Random._GLOBAL_RNG, Val{nothing}})
@ MLUtils ~/.julia/packages/MLUtils/R44Zf/src/eachobs.jl:173
[9] first(itr::DataLoader{Matrix{Float64}, Random._GLOBAL_RNG, Val{nothing}})
@ Base ./abstractarray.jl:424
[10] top-level scope
@ REPL[5]:1
I have had no problems with GPU training and ThreadedEx. Generally for long running tasks I'll SSH and start training inside a tmux session, and then leave and let it run. SSH back into the machine while training is running has been fine. This has been my experience playing with Kaggle datasets on FastAI.jl and Flux. |
How friendly is stopping a DataLoader using ThreadedEx using Ctrl+C? I think that would be a good test to run: create a dataset which doesn't require any network or disk access to load data and see if running a dataloader over it can be interrupted. |
After 'dev MLUtils' with ThreadedEx: julia> using MLUtils
julia> Xtrain = rand(10, 100);
julia> Ytrain = rand('a':'z', 100);
julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true, parallel=true);
julia> for epoch in 1:10000
for (x, y) in train_loader # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end
^CERROR: InterruptException:
Stacktrace:
[1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
@ Base ./task.jl:871
[2] wait()
@ Base ./task.jl:931
[3] wait(c::Base.GenericCondition{ReentrantLock})
@ Base ./condition.jl:124
[4] take_buffered(c::Channel{Any})
@ Base ./channels.jl:416
[5] take!(c::Channel{Any})
@ Base ./channels.jl:410
[6] iterate(#unused#::MLUtils.Loader, state::MLUtils.LoaderState)
@ MLUtils ~/.julia/dev/MLUtils/src/parallel.jl:143
[7] iterate(#unused#::DataLoader{NamedTuple{(:data, :label), Tuple{Matrix{Float64}, Vector{Char}}}, Random._GLOBAL_RNG, Val{nothing}}, ::Tuple{MLUtils.Loader, MLUtils.LoaderState})
@ MLUtils ~/.julia/dev/MLUtils/src/eachobs.jl:179
[8] top-level scope
@ ./REPL[17]:6
julia> for epoch in 1:10000
for (x, y) in train_loader # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end
julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=10, shuffle=true, parallel=true); # Redefined DataLoader
julia> for epoch in 1:10000
for (x, y) in train_loader # access via tuple destructuring
@assert size(x) == (10, 5)
@assert size(y) == (5,)
# loss += f(x, y) # etc, runs 100 * 20 times
end
end
ERROR: AssertionError: size(x) == (10, 5) # <--- expected with batchsize=10 in redefined DataLoader
Stacktrace:
[1] top-level scope
@ ./REPL[19]:3
julia> Does the above example meet the criteria you intended to test? Interrupting DataLoader with ThreadedEx seems okay. |
Usually, the situation is that I start a model training, but notice after a few epochs that the loss isn't behaving as I want it to, I notice a bug in the code, etc. and so I interrupt the training with Ctrl+C. This lead me to the situation where i can not restart training without restarting the REPL, which takes several minutes for my application. |
Good to know. I'll let @lorenzoh make the final call about whether to switch over. |
Sorry for my late reply! I think with all the issues cropping up it seems like a good idea to revert to |
When a program is interrupted (e.g. by any error) while a
DataLoader
is being read, the Dataloader can not be read from anymore, and the Julia session needs to be restarted (which can be very annoying).I've dug into the details and found that the reason is that the default parallel exector is set to
So it seems that the TaskPoolEx is globally defined (a sort of singleton?) and may hang, such that we can't execute in parallel anymore...
Here's a MWE
I therefore recommend that the default executor stays
ThreadedEx
, and a user may choose to try theThreadPoolEx
if they want to speed up performance even more. However, I would also expect thread pools to not be that helpful, since usually dataloaders are not created at such a high rate.The text was updated successfully, but these errors were encountered: