`TaskPoolEx` leads to unreliable Dataloaders #142

RomeoV · 2023-02-08T03:54:28Z

When a program is interrupted (e.g. by any error) while a DataLoader is being read, the Dataloader can not be read from anymore, and the Julia session needs to be restarted (which can be very annoying).
I've dug into the details and found that the reason is that the default parallel exector is set to

_default_executor() = TaskPoolEx(basesize=1, background=Threads.nthreads() > 1)

So it seems that the TaskPoolEx is globally defined (a sort of singleton?) and may hang, such that we can't execute in parallel anymore...

Here's a MWE

# launch with `julia --threads auto`
using FastAI
using FastAI.Flux
using FastVision
using FastAI.Flux.MLUtils
using FastAI.MLUtils.Transducers: ThreadedEx

path = load(datasets()["CUB_200_2011"])
data = FastAI.Datasets.loadfolderdata(
    path,
    filterfn = FastVision.isimagefile,
    loadfn = loadfile,
)
dl = DataLoader(data; parallel=true)
collect(dl);  # <--- interrupt this with Ctrl-C
first(dl)     # <--- this will wait forever now

dl = DataLoader(data; parallel=true)  # <--- even redefining doesn't help
first(dl)     # <--- this will wait forever now

# however we can fix it!
import FastAI.Flux.MLUtils._default_executor
_default_executor() = ThreadedEx()
dl = DataLoader(data; parallel=true)
first(dl)  # now it works again

I therefore recommend that the default executor stays ThreadedEx, and a user may choose to try the ThreadPoolEx if they want to speed up performance even more. However, I would also expect thread pools to not be that helpful, since usually dataloaders are not created at such a high rate.

The text was updated successfully, but these errors were encountered:

RomeoV · 2023-02-08T03:56:42Z

Also the comment claims that thread pools are not used anyways, so there seems to be some confusion one way or the other. Perhaps related #33 ?

MLUtils.jl/src/parallel.jl

Lines 72 to 76 in ff2fcc1

    
           # Unlike DataLoaders.jl, this currently does not use task pools 
        
           # since  `ThreadedEx` has shown to be more performant. This may 
        
           # change in the future. 
        
           # See PR 33 https://github.com/JuliaML/MLUtils.jl/pull/33 
        
           _default_executor() = TaskPoolEx(basesize=1, background=Threads.nthreads() > 1)

Closes JuliaML#142

lorenzoh · 2023-02-08T17:46:05Z

That comment must be outdated 😅
The issue with ThreadedEx was that it would utilize every thread, including the primary one, leading amongst other things, to an SSH connection freezing because all CPU was being used by Julia.

Before moving back to ThreadedEx, we would have to ascertain that

running data loading doesn't freeze the rest of the machine
not having the primary thread free doesn't affect GPU utilization during training

What have been your experiences, if any, running GPU training with ThreadedEx?

RomeoV · 2023-02-09T09:07:26Z

I would assume this is easily fixed by running e.g. julia --threads 14 if your machine has 16 threads etc, no?
Also, in cases where you want to set it and forget it, the fact that it can utilize that much CPU actually seems like a good sign to me.

RomeoV · 2023-02-09T09:25:36Z

Note also what it says in the TaskPoolEx documentation:

Warning:
It is highly discouraged to use this executor in Julia packages; especially those that are used as libraries rather than end-user applications. This is because the whole purpose of this executor is to prevent Julia runtime from doing the right thing for managing tasks. Ideally, the library user should be able to pass an executor as an argument so that your library function can be used with any executors including TaskPoolEx.

RomeoV · 2023-02-09T09:28:39Z

Regarding the GPU utilization question, my mental model tells me this:

The data preprocessing is faster than the GPU processing.
In that case, the cpu threads sit idle, waiting to push the results into a channel, and are free to do any data moving while they're waiting.
The data preprocessing is slower than the GPU processing.
In that case, the cpu should be running with all resources.

Either way, benchmarking it would probably be a good idea, but I would expect differerent results depending on the batch size, data type, gpu model, etc.

jochalek · 2023-03-08T02:50:34Z

I think I encountered this same problem. Attempting to use DataLoader(...; parallel = true ) in some cases will result in it hanging. Reverting to ThreadedEx fixes the problem.

Modifying the example in docs results in this MWE:

using MLUtils

Xtrain = rand(10, 100);
array_loader = DataLoader(Xtrain; batchsize=2, parallel=true);
first(array_loader)     # <--- this will hang forever.

A smaller data size does work for me

using MLUtils

Xtrain = rand(10, 50);
array_loader = DataLoader(Xtrain; batchsize=2, parallel=true);
first(array_loader)

Stacktrace from interrupting the hanging DataLoader

julia> first(array_loader)
^CERROR: InterruptException:
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:871
  [2] wait()
    @ Base ./task.jl:931
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base ./condition.jl:124
  [4] take_buffered(c::Channel{Any})
    @ Base ./channels.jl:416
  [5] take!(c::Channel{Any})
    @ Base ./channels.jl:410
  [6] iterate(#unused#::MLUtils.Loader, state::MLUtils.LoaderState)
    @ MLUtils ~/.julia/packages/MLUtils/R44Zf/src/parallel.jl:140
  [7] iterate(loader::MLUtils.Loader)
    @ MLUtils ~/.julia/packages/MLUtils/R44Zf/src/parallel.jl:132
  [8] iterate(e::DataLoader{Matrix{Float64}, Random._GLOBAL_RNG, Val{nothing}})
    @ MLUtils ~/.julia/packages/MLUtils/R44Zf/src/eachobs.jl:173
  [9] first(itr::DataLoader{Matrix{Float64}, Random._GLOBAL_RNG, Val{nothing}})
    @ Base ./abstractarray.jl:424
 [10] top-level scope
    @ REPL[5]:1

What have been your experiences, if any, running GPU training with ThreadedEx?

I have had no problems with GPU training and ThreadedEx. Generally for long running tasks I'll SSH and start training inside a tmux session, and then leave and let it run. SSH back into the machine while training is running has been fine. This has been my experience playing with Kaggle datasets on FastAI.jl and Flux.

ToucheSir · 2023-03-08T03:27:28Z

How friendly is stopping a DataLoader using ThreadedEx using Ctrl+C? I think that would be a good test to run: create a dataset which doesn't require any network or disk access to load data and see if running a dataloader over it can be interrupted.

jochalek · 2023-03-09T00:23:10Z

After 'dev MLUtils' with ThreadedEx:

julia> using MLUtils

julia> Xtrain = rand(10, 100);

julia> Ytrain = rand('a':'z', 100);

julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=5, shuffle=true, parallel=true);

julia> for epoch in 1:10000
                for (x, y) in train_loader  # access via tuple destructuring
                  @assert size(x) == (10, 5)
                  @assert size(y) == (5,)
                  # loss += f(x, y) # etc, runs 100 * 20 times
                end
              end
^CERROR: InterruptException:
Stacktrace:
 [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
   @ Base ./task.jl:871
 [2] wait()
   @ Base ./task.jl:931
 [3] wait(c::Base.GenericCondition{ReentrantLock})
   @ Base ./condition.jl:124
 [4] take_buffered(c::Channel{Any})
   @ Base ./channels.jl:416
 [5] take!(c::Channel{Any})
   @ Base ./channels.jl:410
 [6] iterate(#unused#::MLUtils.Loader, state::MLUtils.LoaderState)
   @ MLUtils ~/.julia/dev/MLUtils/src/parallel.jl:143
 [7] iterate(#unused#::DataLoader{NamedTuple{(:data, :label), Tuple{Matrix{Float64}, Vector{Char}}}, Random._GLOBAL_RNG, Val{nothing}}, ::Tuple{MLUtils.Loader, MLUtils.LoaderState})
   @ MLUtils ~/.julia/dev/MLUtils/src/eachobs.jl:179
 [8] top-level scope
   @ ./REPL[17]:6

julia> for epoch in 1:10000
                for (x, y) in train_loader  # access via tuple destructuring
                  @assert size(x) == (10, 5)
                  @assert size(y) == (5,)
                  # loss += f(x, y) # etc, runs 100 * 20 times
                end
              end

julia> train_loader = DataLoader((data=Xtrain, label=Ytrain), batchsize=10, shuffle=true, parallel=true); # Redefined DataLoader

julia> for epoch in 1:10000
                for (x, y) in train_loader  # access via tuple destructuring
                  @assert size(x) == (10, 5)
                  @assert size(y) == (5,)
                  # loss += f(x, y) # etc, runs 100 * 20 times
                end
              end
ERROR: AssertionError: size(x) == (10, 5)   # <--- expected with batchsize=10 in redefined DataLoader
Stacktrace:
 [1] top-level scope
   @ ./REPL[19]:3

julia>

Does the above example meet the criteria you intended to test? Interrupting DataLoader with ThreadedEx seems okay.

RomeoV · 2023-03-09T00:36:20Z

How friendly is stopping a DataLoader using ThreadedEx using Ctrl+C? I think that would be a good test to run: create a dataset which doesn't require any network or disk access to load data and see if running a dataloader over it can be interrupted.

Usually, the situation is that I start a model training, but notice after a few epochs that the loss isn't behaving as I want it to, I notice a bug in the code, etc. and so I interrupt the training with Ctrl+C. This lead me to the situation where i can not restart training without restarting the REPL, which takes several minutes for my application.

ToucheSir · 2023-03-09T01:53:03Z

Good to know. I'll let @lorenzoh make the final call about whether to switch over.

lorenzoh · 2023-03-25T08:02:53Z

Sorry for my late reply! I think with all the issues cropping up it seems like a good idea to revert to ThreadedEx. I'll leave some additional comments on the PR. Thanks everyone for the discussion!

Closes JuliaML#142

RomeoV added a commit to RomeoV/MLUtils.jl that referenced this issue Feb 8, 2023

Change default parallel executor to ThreadedEx

f707f2f

Closes JuliaML#142

RomeoV mentioned this issue Feb 8, 2023

Change default parallel executor to ThreadedEx #143

Merged

jeremiedb mentioned this issue Feb 19, 2023

Transfer learning ResNet FluxML/model-zoo#395

Merged

RomeoV mentioned this issue Mar 9, 2023

DataLoaders(..., parallel=true) hanging #132

Open

RomeoV added a commit to RomeoV/MLUtils.jl that referenced this issue Jun 12, 2023

Change default parallel executor to ThreadedEx

4763b2c

Closes JuliaML#142

darsnack closed this as completed in #143 Jun 12, 2023

nikopj mentioned this issue Jul 14, 2023

eachobsparallel CUDA Error while freeing... #161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TaskPoolEx` leads to unreliable Dataloaders #142

`TaskPoolEx` leads to unreliable Dataloaders #142

RomeoV commented Feb 8, 2023

RomeoV commented Feb 8, 2023

lorenzoh commented Feb 8, 2023

RomeoV commented Feb 9, 2023 •

edited

Loading

RomeoV commented Feb 9, 2023

RomeoV commented Feb 9, 2023

jochalek commented Mar 8, 2023 •

edited

Loading

ToucheSir commented Mar 8, 2023

jochalek commented Mar 9, 2023 •

edited

Loading

RomeoV commented Mar 9, 2023

ToucheSir commented Mar 9, 2023

lorenzoh commented Mar 25, 2023

TaskPoolEx leads to unreliable Dataloaders #142

TaskPoolEx leads to unreliable Dataloaders #142

Comments

RomeoV commented Feb 8, 2023

RomeoV commented Feb 8, 2023

lorenzoh commented Feb 8, 2023

RomeoV commented Feb 9, 2023 • edited Loading

RomeoV commented Feb 9, 2023

RomeoV commented Feb 9, 2023

jochalek commented Mar 8, 2023 • edited Loading

ToucheSir commented Mar 8, 2023

jochalek commented Mar 9, 2023 • edited Loading

RomeoV commented Mar 9, 2023

ToucheSir commented Mar 9, 2023

lorenzoh commented Mar 25, 2023

`TaskPoolEx` leads to unreliable Dataloaders #142

`TaskPoolEx` leads to unreliable Dataloaders #142

RomeoV commented Feb 9, 2023 •

edited

Loading

jochalek commented Mar 8, 2023 •

edited

Loading

jochalek commented Mar 9, 2023 •

edited

Loading