Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel data iterator using FLoops.jl #33

Merged
merged 13 commits into from
Mar 1, 2022

Conversation

lorenzoh
Copy link
Contributor

@lorenzoh lorenzoh commented Feb 2, 2022

This is a proof-of-concept for a parallel data iterator matching the behavior of DataLoaders.jl, but implemented using FLoops.jl and FoldsThreads.jl. Closes #30

Why use FLoops.jl for this and not just copy from DataLoaders.jl?

  • The parallelism code in DataLoaders.jl is quite opaque (I would know) and not exactly minimal, to the point where I've been very hesitant to make changes.
  • DataLoaders.jl has had a longstanding issue with brittle interrupt handling where, sometimes, the session would hang and need to be restarted. AFAICT this issue does not exist with this PR's implementation.
  • FoldsThreads.jl provides many different executors, allowing to adapt to the specific workload
  • Where as DataLoaders.jl would use either N or N-1 threads, FLoops.jl allows using fewer workers too using basesize

Current scope:

Benchmarks vs. DataLoaders.jl

I have done some throughput tests on synthetic, uniform workloads and this PR seems to perform pretty well.

import LearnBase
import MLUtils
using Transducers: ThreadedEx

struct TimeDataset
    n::Any
    t::Any
end

LearnBase.getobs(data::TimeDataset, idx) = (sleep(data.t); return idx)
LearnBase.nobs(data::TimeDataset) = data.n
MLUtils.getobs(data::TimeDataset, idx) = (sleep(data.t); return idx)
MLUtils.numobs(data::TimeDataset) = data.n

data = TimeDataset(100, 0.1)
@elapsed for _ in MLUtils.eachobsparallel(data, ThreadedEx()) end
@elapsed for _ in DataLoaders.eachobsparallel(data) end

data = TimeDataset(10000, 0.001)
@elapsed for _ in MLUtils.eachobsparallel(data, ThreadedEx()) end
@elapsed for _ in DataLoaders.eachobsparallel(data) end

data = TimeDataset(10000, 0.0001)
@elapsed for _ in MLUtils.eachobsparallel(data, ThreadedEx()) end
@elapsed for _ in DataLoaders.eachobsparallel(data) end

data = TimeDataset(10000, 0.00001)
@elapsed for _ in MLUtils.eachobsparallel(data, ThreadedEx()) end
@elapsed for _ in DataLoaders.eachobsparallel(data) end

image

That said, I still need to test how this translates to real-world performance by testing it on some FastAI.jl workloads, though I assume the overhead of either approach will be negligible for either approaches.

@lorenzoh
Copy link
Contributor Author

lorenzoh commented Feb 2, 2022

So, I did some performance measurements on FastAI.jl workloads on imagenette2-320, specifically:

  • loading all 13000 images
  • encoding a single image 13000 times
  • loading and encoding all 13000 images

Setup something like this:

using FastAI
data, blocks = loaddataset("imagenette2-320", (Image, Label))

@time for obs in tqdm(DataLoaders.eachobsparallel(data, useprimary=true)) end
@time for obs in tqdm(MLUtils.eachobsparallel(data, EXECUTOR)) end

Tested on my machine with 12 physical cores and -t 12. I also eyeballed the CPU utilization of the process.
The results are very consistent across the 3 workloads:

  • DataLoader and MLUtils.eachobsparallel(data, ThreadedEx(basesize=1), buffersize=48) are equally fast and don't leave much room for improvement, with utilization around 1150%-1200% CPUs
  • all other Executors perform worse than DataLoaders.jl, with peak utilization of 800-900%.

This is good news though, showing that the much simpler Folds-based implementation could be a usable replacement for DataLoaders.

@lorenzoh
Copy link
Contributor Author

lorenzoh commented Feb 2, 2022

The remaining question is how the new implementation affects the training loop; DataLoaders allows useprimary = false to keep the main thread free (and gets 1050-1100% util.). TaskPoolEx can also do this with background = true, however above benchmarks show this executor to be much slower (1.3x) than ThreadedEx which does not support the argument.

@CarloLucibello
Copy link
Member

CarloLucibello commented Feb 3, 2022

So nice to have such a minimal implementation!

DataLoaders allows useprimary = false to keep the main thread free (and gets 1050-1100% util.). TaskPoolEx can also do this with background = true, however above benchmarks show this executor to be much slower (1.3x) than ThreadedEx which does not support the argument.

Maybe we could file a feature request to Transducers.jl? Or we could kindly ask
@tkf for suggestions

@tkf
Copy link

tkf commented Feb 4, 2022

TaskPoolEx is very bare-bone and I've never tried to seriously optimize it. So, I suspect there are several low-hanging fruits. That said, I don't think I have time to try optimizing it ATM, at least until next month. But please feel free to file an issue in https://github.com/JuliaFolds/FoldsThreads.jl

BTW, since you are calling put! in the loop body and doing some I/O, maybe playing with basesize is useful. The idea is to create more than nthreads tasks so that you can execute the floop body while the data consumer is doing something else. It'd be something like ThreadedEx(basesize = cld(n, 8 * nthreads())) where 8 is the "over-subscription" parameter. But, if you are already seeing CPUs maxed out, maybe it doesn't matter. Also, currently, TaskPoolEx doesn't support this use case (it's straightforward to add it, though).

@CarloLucibello
Copy link
Member

@tkf adding the background=true option to ThreadedEx instead would be easy to do?

@tkf
Copy link

tkf commented Feb 4, 2022

Unfortunately not. background=true requires an approach similar to TaskPoolEx

@lorenzoh
Copy link
Contributor Author

lorenzoh commented Feb 4, 2022

Thanks for the comments, Takafumi. Playing with basesize unfortunately didn't change much using TaskPoolEx and 1 maxes out ThreadedEx as you say. Maybe I can have a look at TaskPoolEx, after all it should be doing the same as DataLoaders.jl is doing currently.

src/MLUtils.jl Outdated Show resolved Hide resolved
@lorenzoh
Copy link
Contributor Author

I've went ahead and added the buffered version of the parallel data loader, so this closes #30. I can't request a review for some reason, but this is ready for review.

I have an idea for a wrapper that ensures ordering of the returned observations (probably at some performance hit), but that'll be in another PR.

Last thing to add from DataLoaders.jl is collating and the collated batch view, i.e. #29.

@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2022

Codecov Report

Merging #33 (19ddaa5) into main (fee6771) will increase coverage by 0.71%.
The diff coverage is 94.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #33      +/-   ##
==========================================
+ Coverage   89.18%   89.89%   +0.71%     
==========================================
  Files          13       14       +1     
  Lines         416      475      +59     
==========================================
+ Hits          371      427      +56     
- Misses         45       48       +3     
Impacted Files Coverage Δ
src/parallel.jl 94.91% <94.91%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fee6771...19ddaa5. Read the comment docs.

src/parallel.jl Outdated Show resolved Hide resolved
@darsnack
Copy link
Member

Since we are using ThreadEx here, what's the final story for keeping the main training thread free?

@lorenzoh
Copy link
Contributor Author

Since we are using ThreadEx here, what's the final story for keeping the main training thread free?

Still needs more investigation, but I think that's best tackled in a follow-up PR. My theory is that unless the data pipeline is the bottleneck anyway, the threads will prefetch enough observations so the primary thread can do its thing.

Will plug this into the FluxTraining.jl profiler to investigate. If this is a problem, we'll have to look into how to speed up TaskPoolEx.

src/parallel.jl Outdated Show resolved Hide resolved
src/MLUtils.jl Outdated Show resolved Hide resolved
src/parallel.jl Outdated Show resolved Hide resolved
src/parallel.jl Outdated Show resolved Hide resolved
src/MLUtils.jl Outdated
@@ -30,6 +33,9 @@ export batchsize,
include("eachobs.jl")
export eachobs

include("parallel.jl")
export eachobsparallel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should avoid exporting this until we settle for some interface (which could also be eachobs(..., parallel=true))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove 👍

src/MLUtils.jl Outdated Show resolved Hide resolved
src/MLUtils.jl Outdated
@@ -30,6 +33,9 @@ export batchsize,
include("eachobs.jl")
export eachobs

include("parallel.jl")
export eachobsparallel
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove 👍

test/parallel.jl Outdated Show resolved Hide resolved
Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>
src/parallel.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Port parallel loaders from DataLoaders.jl
6 participants