-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parallel data iterator using FLoops.jl #33
Conversation
So, I did some performance measurements on FastAI.jl workloads on
Setup something like this: using FastAI
data, blocks = loaddataset("imagenette2-320", (Image, Label))
@time for obs in tqdm(DataLoaders.eachobsparallel(data, useprimary=true)) end
@time for obs in tqdm(MLUtils.eachobsparallel(data, EXECUTOR)) end Tested on my machine with 12 physical cores and
This is good news though, showing that the much simpler Folds-based implementation could be a usable replacement for DataLoaders. |
The remaining question is how the new implementation affects the training loop; DataLoaders allows |
So nice to have such a minimal implementation!
Maybe we could file a feature request to Transducers.jl? Or we could kindly ask |
BTW, since you are calling |
@tkf adding the |
Unfortunately not. |
Thanks for the comments, Takafumi. Playing with |
…to lorenzoh/parallel
I've went ahead and added the buffered version of the parallel data loader, so this closes #30. I can't request a review for some reason, but this is ready for review. I have an idea for a wrapper that ensures ordering of the returned observations (probably at some performance hit), but that'll be in another PR. Last thing to add from DataLoaders.jl is collating and the collated batch view, i.e. #29. |
Codecov Report
@@ Coverage Diff @@
## main #33 +/- ##
==========================================
+ Coverage 89.18% 89.89% +0.71%
==========================================
Files 13 14 +1
Lines 416 475 +59
==========================================
+ Hits 371 427 +56
- Misses 45 48 +3
Continue to review full report at Codecov.
|
Since we are using |
Still needs more investigation, but I think that's best tackled in a follow-up PR. My theory is that unless the data pipeline is the bottleneck anyway, the threads will prefetch enough observations so the primary thread can do its thing. Will plug this into the FluxTraining.jl profiler to investigate. If this is a problem, we'll have to look into how to speed up |
src/MLUtils.jl
Outdated
@@ -30,6 +33,9 @@ export batchsize, | |||
include("eachobs.jl") | |||
export eachobs | |||
|
|||
include("parallel.jl") | |||
export eachobsparallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should avoid exporting this until we settle for some interface (which could also be eachobs(..., parallel=true)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove 👍
src/MLUtils.jl
Outdated
@@ -30,6 +33,9 @@ export batchsize, | |||
include("eachobs.jl") | |||
export eachobs | |||
|
|||
include("parallel.jl") | |||
export eachobsparallel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove 👍
Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>
This is a
proof-of-concept for aparallel data iterator matching the behavior of DataLoaders.jl, but implemented using FLoops.jl and FoldsThreads.jl. Closes #30Why use FLoops.jl for this and not just copy from DataLoaders.jl?
Current scope:
Loader
; just need to port DataLoaders.jl'sRingBuffer
Benchmarks vs. DataLoaders.jl
I have done some throughput tests on synthetic, uniform workloads and this PR seems to perform pretty well.
That said, I still need to test how this translates to real-world performance by testing it on some FastAI.jl workloads, though I assume the overhead of either approach will be negligible for either approaches.