Implement parallel data iterator using FLoops.jl #33

lorenzoh · 2022-02-02T19:27:47Z

This is a ~~proof-of-concept for a~~ parallel data iterator matching the behavior of DataLoaders.jl, but implemented using FLoops.jl and FoldsThreads.jl. Closes #30

Why use FLoops.jl for this and not just copy from DataLoaders.jl?

The parallelism code in DataLoaders.jl is quite opaque (I would know) and not exactly minimal, to the point where I've been very hesitant to make changes.
DataLoaders.jl has had a longstanding issue with brittle interrupt handling where, sometimes, the session would hang and need to be restarted. AFAICT this issue does not exist with this PR's implementation.
FoldsThreads.jl provides many different executors, allowing to adapt to the specific workload
Where as DataLoaders.jl would use either N or N-1 threads, FLoops.jl allows using fewer workers too using basesize

Current scope:

buffered version is not implemented yet, but can easily reuse this Loader; just need to port DataLoaders.jl's RingBuffer
API not final, see Port parallel loaders from DataLoaders.jl #30 for discussion on API

Benchmarks vs. DataLoaders.jl

I have done some throughput tests on synthetic, uniform workloads and this PR seems to perform pretty well.

import LearnBase
import MLUtils
using Transducers: ThreadedEx

struct TimeDataset
    n::Any
    t::Any
end

LearnBase.getobs(data::TimeDataset, idx) = (sleep(data.t); return idx)
LearnBase.nobs(data::TimeDataset) = data.n
MLUtils.getobs(data::TimeDataset, idx) = (sleep(data.t); return idx)
MLUtils.numobs(data::TimeDataset) = data.n

data = TimeDataset(100, 0.1)
@elapsed for _ in MLUtils.eachobsparallel(data, ThreadedEx()) end
@elapsed for _ in DataLoaders.eachobsparallel(data) end

data = TimeDataset(10000, 0.001)
@elapsed for _ in MLUtils.eachobsparallel(data, ThreadedEx()) end
@elapsed for _ in DataLoaders.eachobsparallel(data) end

data = TimeDataset(10000, 0.0001)
@elapsed for _ in MLUtils.eachobsparallel(data, ThreadedEx()) end
@elapsed for _ in DataLoaders.eachobsparallel(data) end

data = TimeDataset(10000, 0.00001)
@elapsed for _ in MLUtils.eachobsparallel(data, ThreadedEx()) end
@elapsed for _ in DataLoaders.eachobsparallel(data) end

That said, I still need to test how this translates to real-world performance by testing it on some FastAI.jl workloads, though I assume the overhead of either approach will be negligible for either approaches.

lorenzoh · 2022-02-02T20:24:30Z

So, I did some performance measurements on FastAI.jl workloads on imagenette2-320, specifically:

loading all 13000 images
encoding a single image 13000 times
loading and encoding all 13000 images

Setup something like this:

using FastAI
data, blocks = loaddataset("imagenette2-320", (Image, Label))

@time for obs in tqdm(DataLoaders.eachobsparallel(data, useprimary=true)) end
@time for obs in tqdm(MLUtils.eachobsparallel(data, EXECUTOR)) end

Tested on my machine with 12 physical cores and -t 12. I also eyeballed the CPU utilization of the process.
The results are very consistent across the 3 workloads:

DataLoader and MLUtils.eachobsparallel(data, ThreadedEx(basesize=1), buffersize=48) are equally fast and don't leave much room for improvement, with utilization around 1150%-1200% CPUs
all other Executors perform worse than DataLoaders.jl, with peak utilization of 800-900%.

This is good news though, showing that the much simpler Folds-based implementation could be a usable replacement for DataLoaders.

lorenzoh · 2022-02-02T20:28:48Z

The remaining question is how the new implementation affects the training loop; DataLoaders allows useprimary = false to keep the main thread free (and gets 1050-1100% util.). TaskPoolEx can also do this with background = true, however above benchmarks show this executor to be much slower (1.3x) than ThreadedEx which does not support the argument.

CarloLucibello · 2022-02-03T07:02:13Z

So nice to have such a minimal implementation!

DataLoaders allows useprimary = false to keep the main thread free (and gets 1050-1100% util.). TaskPoolEx can also do this with background = true, however above benchmarks show this executor to be much slower (1.3x) than ThreadedEx which does not support the argument.

Maybe we could file a feature request to Transducers.jl? Or we could kindly ask
@tkf for suggestions

tkf · 2022-02-04T08:35:54Z

TaskPoolEx is very bare-bone and I've never tried to seriously optimize it. So, I suspect there are several low-hanging fruits. That said, I don't think I have time to try optimizing it ATM, at least until next month. But please feel free to file an issue in https://github.com/JuliaFolds/FoldsThreads.jl

BTW, since you are calling put! in the loop body and doing some I/O, maybe playing with basesize is useful. The idea is to create more than nthreads tasks so that you can execute the floop body while the data consumer is doing something else. It'd be something like ThreadedEx(basesize = cld(n, 8 * nthreads())) where 8 is the "over-subscription" parameter. But, if you are already seeing CPUs maxed out, maybe it doesn't matter. Also, currently, TaskPoolEx doesn't support this use case (it's straightforward to add it, though).

CarloLucibello · 2022-02-04T09:27:50Z

@tkf adding the background=true option to ThreadedEx instead would be easy to do?

tkf · 2022-02-04T10:02:28Z

Unfortunately not. background=true requires an approach similar to TaskPoolEx

lorenzoh · 2022-02-04T10:16:59Z

Thanks for the comments, Takafumi. Playing with basesize unfortunately didn't change much using TaskPoolEx and 1 maxes out ThreadedEx as you say. Maybe I can have a look at TaskPoolEx, after all it should be doing the same as DataLoaders.jl is doing currently.

…to lorenzoh/parallel

src/MLUtils.jl

lorenzoh · 2022-02-24T10:23:33Z

I've went ahead and added the buffered version of the parallel data loader, so this closes #30. I can't request a review for some reason, but this is ready for review.

I have an idea for a wrapper that ensures ordering of the returned observations (probably at some performance hit), but that'll be in another PR.

Last thing to add from DataLoaders.jl is collating and the collated batch view, i.e. #29.

codecov-commenter · 2022-02-24T10:25:51Z

Codecov Report

Merging #33 (19ddaa5) into main (fee6771) will increase coverage by 0.71%.
The diff coverage is 94.91%.

@@            Coverage Diff             @@
##             main      #33      +/-   ##
==========================================
+ Coverage   89.18%   89.89%   +0.71%     
==========================================
  Files          13       14       +1     
  Lines         416      475      +59     
==========================================
+ Hits          371      427      +56     
- Misses         45       48       +3

Impacted Files	Coverage Δ
src/parallel.jl	`94.91% <94.91%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fee6771...19ddaa5. Read the comment docs.

src/parallel.jl

darsnack · 2022-02-24T15:07:12Z

Since we are using ThreadEx here, what's the final story for keeping the main training thread free?

lorenzoh · 2022-02-24T15:11:49Z

Since we are using ThreadEx here, what's the final story for keeping the main training thread free?

Still needs more investigation, but I think that's best tackled in a follow-up PR. My theory is that unless the data pipeline is the bottleneck anyway, the threads will prefetch enough observations so the primary thread can do its thing.

Will plug this into the FluxTraining.jl profiler to investigate. If this is a problem, we'll have to look into how to speed up TaskPoolEx.

src/parallel.jl

src/MLUtils.jl

src/parallel.jl

CarloLucibello · 2022-02-27T15:34:36Z

src/MLUtils.jl

@@ -30,6 +33,9 @@ export batchsize,
 include("eachobs.jl")
 export eachobs

+include("parallel.jl")
+export eachobsparallel


we should avoid exporting this until we settle for some interface (which could also be eachobs(..., parallel=true))

Will remove 👍

src/MLUtils.jl

lorenzoh · 2022-02-28T16:24:00Z

src/MLUtils.jl

@@ -30,6 +33,9 @@ export batchsize,
 include("eachobs.jl")
 export eachobs

+include("parallel.jl")
+export eachobsparallel


Will remove 👍

test/parallel.jl

Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>

src/parallel.jl

test/runtests.jl

src/parallel.jl

Implement eachobsparallel

e6461ef

CarloLucibello mentioned this pull request Feb 3, 2022

remove eachbatch and rebase DataLoader on eachobs #34

Merged

lorenzoh mentioned this pull request Feb 7, 2022

benchmark results against other backends JuliaIO/JpegTurbo.jl#15

Open

lorenzoh added 7 commits February 24, 2022 09:00

Put tasks async, and close channel when iteration is over

d69f2f8

Implement eachobsparallel

ea46bc8

Put tasks async, and close channel when iteration is over

1562780

Merge branch 'lorenzoh/parallel' of github.com:lorenzoh/MLUtils.jl in…

95d4469

…to lorenzoh/parallel

extend docstring for eachobsparallel

2fe6107

Merge branch 'main' into lorenzoh/parallel

e51d6b2

Add buffered eachobsparallel

57f8d6e

lorenzoh commented Feb 24, 2022

View reviewed changes

src/MLUtils.jl Outdated Show resolved Hide resolved

Re-add shuffleobs that was mistakenly dropped when merging

121405f

darsnack reviewed Feb 24, 2022

View reviewed changes

src/parallel.jl Outdated Show resolved Hide resolved

lorenzoh commented Feb 27, 2022

View reviewed changes

src/parallel.jl Outdated Show resolved Hide resolved

Restrict Loader field types

6230446

lorenzoh requested a review from darsnack February 27, 2022 11:30

CarloLucibello reviewed Feb 27, 2022

View reviewed changes

src/MLUtils.jl Outdated Show resolved Hide resolved

CarloLucibello reviewed Feb 27, 2022

View reviewed changes

lorenzoh commented Feb 28, 2022

View reviewed changes

Don't export eachobsparallel for now and more eachobsparallel kwargs

5f598d9

Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>

lorenzoh mentioned this pull request Feb 28, 2022

Add collate and BatchViewCollated functionality from DataLoaders.jl #63

Closed

lorenzoh commented Feb 28, 2022

View reviewed changes

src/parallel.jl Outdated Show resolved Hide resolved

Readd comma

a961b03

lorenzoh commented Feb 28, 2022

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

Import eachobsparallel for tests

19ddaa5

darsnack reviewed Feb 28, 2022

View reviewed changes

src/parallel.jl Show resolved Hide resolved

lorenzoh merged commit 3655055 into JuliaML:main Mar 1, 2022

This was referenced Mar 2, 2022

API for iterator/view variants #66

Closed

MLUtils.jl transition FluxML/FastAI.jl#196

Closed

lorenzoh mentioned this pull request Mar 18, 2022

Reproductivity problem with multi-threading lorenzoh/DataLoaders.jl#32

Open

RomeoV mentioned this pull request Feb 8, 2023

TaskPoolEx leads to unreliable Dataloaders #142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel data iterator using FLoops.jl #33

Implement parallel data iterator using FLoops.jl #33

lorenzoh commented Feb 2, 2022 •

edited

Loading

lorenzoh commented Feb 2, 2022 •

edited

Loading

lorenzoh commented Feb 2, 2022

CarloLucibello commented Feb 3, 2022 •

edited

Loading

tkf commented Feb 4, 2022

CarloLucibello commented Feb 4, 2022

tkf commented Feb 4, 2022

lorenzoh commented Feb 4, 2022

lorenzoh commented Feb 24, 2022

codecov-commenter commented Feb 24, 2022 •

edited

Loading

darsnack commented Feb 24, 2022

lorenzoh commented Feb 24, 2022

CarloLucibello Feb 27, 2022

lorenzoh Feb 28, 2022

lorenzoh Feb 28, 2022

Implement parallel data iterator using FLoops.jl #33

Implement parallel data iterator using FLoops.jl #33

Conversation

lorenzoh commented Feb 2, 2022 • edited Loading

Benchmarks vs. DataLoaders.jl

lorenzoh commented Feb 2, 2022 • edited Loading

lorenzoh commented Feb 2, 2022

CarloLucibello commented Feb 3, 2022 • edited Loading

tkf commented Feb 4, 2022

CarloLucibello commented Feb 4, 2022

tkf commented Feb 4, 2022

lorenzoh commented Feb 4, 2022

lorenzoh commented Feb 24, 2022

codecov-commenter commented Feb 24, 2022 • edited Loading

Codecov Report

darsnack commented Feb 24, 2022

lorenzoh commented Feb 24, 2022

CarloLucibello Feb 27, 2022

Choose a reason for hiding this comment

lorenzoh Feb 28, 2022

Choose a reason for hiding this comment

lorenzoh Feb 28, 2022

Choose a reason for hiding this comment

lorenzoh commented Feb 2, 2022 •

edited

Loading

lorenzoh commented Feb 2, 2022 •

edited

Loading

CarloLucibello commented Feb 3, 2022 •

edited

Loading

codecov-commenter commented Feb 24, 2022 •

edited

Loading