-
-
Notifications
You must be signed in to change notification settings - Fork 617
add DataLoader #1051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add DataLoader #1051
Conversation
|
I very much like the metalhead one too, hopefully having that would solve a lot of data management |
| Xtrain = rand(10, 100) | ||
| Ytrain = rand(100) | ||
| dtrain = DataLoader(Xtrain, Ytrain, batchsize=2, shuffle=true) | ||
| for epoch in 1:100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use train!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this example since I want to show that dtrain (also with shuffle=true) can be reused.
Added an example with train! below.
| export Iris | ||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random space
src/data/dataloader.jl
Outdated
| end | ||
| end | ||
| """ | ||
| function DataLoader(xs...; batchsize=1, shuffle=false, partial=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splatting here seems a bit odd, and shouldn't we make it an iterator/ generator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't come up with a generator based solution also supporting shuffle and partial.
Splatting is there in order to support
DataLoader(X)
DataLoader(X, Y)
DataLoader(X, Y, Z)
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like a sampler over the batch dimension could act as a surrogate, no?
Also, are the X Y and Z supposed to be minibatches? Similar to a splatted version of the data that we send to train! Currently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X and Y are inputs and labels tensor respectively, essentially they will be partitioned and zipped together. The examples should be quite clear, I could improve the docstring though
|
Can we make this an abstract type so users can define their own dataloaders for things like video etc ala pytorch. Also have you seen https://github.com/JuliaML/MLDataUtils.jl? |
|
Isn’t this the whole point of open source software? ;)
Please use whatever you’d like from Knet. Give credit in comments if you’d
like.
…On Wed, Feb 26, 2020 at 4:10 PM Carlo Lucibello ***@***.***> wrote:
Fix #450 <#450>
This adds a DataLoader type, largely adapted from the Knet one, therefore
pinging @denizyuret <https://github.com/denizyuret> to check if he is
cool with this. Unfortunately, I cannot "unsee" his implementation, and in
any case any reasonable alternative implementation will be pretty much
similar I guess.
This is an initial implementation to get things going, possibly in the
future we will also want a distributed and out-of-memory option as the one
implemented by @staticfloat <https://github.com/staticfloat> here
https://github.com/FluxML/Metalhead.jl/blob/sf/training/training/ImageNet/dataset.jl
------------------------------
You can view, comment on, or merge this pull request online at:
#1051
Commit Summary
- add DataLoader
File Changes
- *M* docs/make.jl
<https://github.com/FluxML/Flux.jl/pull/1051/files#diff-0> (4)
- *A* docs/src/data/dataloader.md
<https://github.com/FluxML/Flux.jl/pull/1051/files#diff-1> (6)
- *M* src/Flux.jl
<https://github.com/FluxML/Flux.jl/pull/1051/files#diff-2> (1)
- *M* src/data/Data.jl
<https://github.com/FluxML/Flux.jl/pull/1051/files#diff-3> (8)
- *A* src/data/dataloader.jl
<https://github.com/FluxML/Flux.jl/pull/1051/files#diff-4> (82)
- *M* test/data.jl
<https://github.com/FluxML/Flux.jl/pull/1051/files#diff-5> (83)
- *M* test/runtests.jl
<https://github.com/FluxML/Flux.jl/pull/1051/files#diff-6> (68)
Patch Links:
- https://github.com/FluxML/Flux.jl/pull/1051.patch
- https://github.com/FluxML/Flux.jl/pull/1051.diff
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1051?email_source=notifications&email_token=AAN43JX3HUC6YZUIBHFPUPLREZS33A5CNFSM4K4E6OM2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IQOCVRQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN43JX4ZV3XMDIQNVJL5OLREZS33ANCNFSM4K4E6OMQ>
.
|
For the time being, I don't see much need for an abstract interface, but we can revisit when the need arises
Yes but those repos have very low activity. Also, I prefer to have an in-house solution to tune around Flux specific neeed. Again, we can revisit in case JuliaML becomes more actively maintained |
|
Mhmh, for composability with for (x,) in data
...
end and I prefer for x in data
...
end Maybe we can modify |
eb16d85 to
1b5197c
Compare
|
I ended up modifying This PR now is breaking, but I think only slightly so, since I special cased just |
|
@MikeInnes this is very slightly breaking, I don't think we even need to bump Flux's major version. Are you ok with merging it? |
| try | ||
| gs = gradient(ps) do | ||
| loss(d...) | ||
| if d isa AbstractArray{<:Number} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry this will casuse problems e.g. with
Array{Union{Missing, Float64}} that contains no missing values.
It is pretty easy in general to end-up with these, since they come out of DataFrames fairly readily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be addressed by d isa Array{<:Union{Number,Missing}} but I think that would be part of data preprocessing, same as converting image types to numeric types, so I'm not sure we should support that since flux models won't play nice with missings.
| backpropagation and calls the optimizer `opt`. | ||
| In case datapoints `d` are of numeric array type, assumes no splatting is needed | ||
| and computes the gradient of `loss(d)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always findworking out how train is going to iterate my data and pass it to the loss confusing.
This seems like it is going to make it even more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, I would ditch train! entirely, I never use it in my code, plain loops are much more transparent and remove the need for callbacks' gymnastic
|
Why are we reimplementing MLDataPattern with-in Flux? Is it because we want to make it more batteries included? If so maybe we can just Is it because MLDataPattern is a bit crufty and old ? If so, other good options include updating it, and/or making a new competing package. One of the great things about Flux/Julia is that it doesn't need everything to be reimplemented just for it, and it can use the whole ecosystem. Here is an example of me using MLDataPattern/MLDataUtils with Flux. There are more examples here, though apparently the Flux example (unlike the others) is not using MLDataUtils.(I think because I get confused by |
|
I agree with @oxinabox, MLDataUtils.jl provides a general implementation of data-related functionality. |
|
I agree as well, reviving MLDataUtils.jl would be ideal. I'm not entirely familiar with the codebase and it looks hard for me to contribute there: it has many dependencies, a complicated codebase, makes heavy usage of custom types, and some parts are (or feel) outdated. I think it is quite urgent to have in the Flux (or ML in general) ecosystem a simple data iterator, that takes arrays as inputs for the constructor, and supports DataLoader(X, batchsize=128, shuffle=true)
DataLoader(X, Y, batchsize=128, partial=false)Can we quickly cook something like that in MLDataUtils so that we can "sponsor" it in Flux? OneHotEncoding presented a similar case: instead of using the one in MLLabelUtils, people implemented an in-house solution here, probably because that was just simpler and more manageable. Again, I'm not utterly familiar with the JuliaML codebase, but it just feels very complicated. Maybe starting with a new repo, e.g. MLDataTools?, would be the best path forward. Things could go as follows:
I would be in favor of starting a new repo with minimal functionality, just onehot encoding and dataloader, and then start bringing in functionality from MLDataUtils while keeping it lean and simple. @oxinabox Is this reasonable or you think revamping the JuliaML infrastructure is a better way forward? In any case, if a functionally equivalent alternative to the DataLoader here does not materialize itself within a week or two, I would still be in favor of merging this PR and eventually deprecate it later when we have a valid alternative. |
|
TL;DR Here's the more or less equivalent version written in MLDataPattern function DataLoader(data...; batchsize=1, shuffle=false, partial=true)
count = partial ? ceil(Int, length(data[1])/batchsize) : floor(Int, length(data[1])/batchsize)
data = shuffle ? shuffleobs(data) : data
batchview(data, batchsize, count, ObsDim.Last())
endThe whole codebase there looks complicated at first glance because its design aims to more flexible and composable functionalities. And I would say it's documented very well (that I don't even have a chance/reason to rebuild it with Documenter). What's needed IMO might be some tailored demos for Flux's usage. As for the function DataLoader(data... batchsize=1, shuffle=false, partial=true, transform=NoOp())
count = partial ? ceil(Int, length(data[1])/batchsize) : floor(Int, length(data[1])/batchsize)
data = shuffle ? shuffleobs(data) : data
batchview(augment(data, transform), batchsize, count, ObsDim.Last())
end |
this wouldn't reshuffle at every epoch though |
You know, it's completely doable function DataLoader(data...; batchsize=1, shuffle=false, partial=true)
count = partial ? ceil(Int, length(data[1])/batchsize) : floor(Int, length(data[1])/batchsize)
iter = shuffle ? RandomBatches : BatchView
iter(data, batchsize, count, ObsDim.Last())
endSo far I just think the DataLoader abstraction doesn't make things easier; except for users that sticks to the Pytorch experience :P |
|
Ok, getting there julia> X = reshape([1:10;], (2, 5))
2×5 Array{Int64,2}:
1 3 5 7 9
2 4 6 8 10
julia> function DataLoader(data...; batchsize=1, shuffle=false, partial=true)
s = size(data[1])[end]/batchsize
count = partial ? ceil(Int, s) : floor(Int, s)
@show count
iter = shuffle ? RandomBatches : BatchView
iter(data, batchsize, count, ObsDim.Last())
end
DataLoader (generic function with 1 method)
# this is OK
julia> DataLoader(X, batchsize=2, partial=false, shuffle=false) |> collect
count = 2
2-element Array{Tuple{SubArray{Int64,2,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true}},1}:
([1 3; 2 4],)
([5 7; 6 8],)
# here what we want is sampling WITHOUT repetition
julia> DataLoader(X, batchsize=2, partial=false, shuffle=true) |> collect
count = 2
2-element Array{Tuple{SubArray{Int64,2,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false}},1}:
([1 3; 2 4],)
([5 3; 6 4],)
# ???
julia> DataLoader(X, batchsize=2, partial=true, shuffle=false) |> collect
count = 3
ERROR: ArgumentError: Specified number of partitions is too large for the specified size
Stacktrace:
[1] _compute_batch_settings(::Tuple{Array{Int64,2}}, ::Int64, ::Int64, ::LearnBase.ObsDim.Last, ::Bool) at /home/carlo/.julia/packages/MLDataPattern/mX21p/src/dataview.jl:197
[2] BatchView at /home/carlo/.julia/packages/MLDataPattern/mX21p/src/dataview.jl:359 [inlined] (repeats 2 times)
[3] #DataLoader#25(::Int64, ::Bool, ::Bool, ::typeof(DataLoader), ::Array{Int64,2}) at ./REPL[62]:6
[4] (::var"#kw##DataLoader")(::NamedTuple{(:batchsize, :partial, :shuffle),Tuple{Int64,Bool,Bool}}, ::typeof(DataLoader), ::Array{Int64,2}) at ./none:0
[5] top-level scope at REPL[65]:1What I've seen up to now is that pytorch dataloader experience is smoother than anything else we have here |
|
I think a new repo is reasonable (I am a fan of the name MLDataUtils2, following tradition of JSON2)
Now these could be fixed in the current repo. |
|
it seems you clearly outlined the current design problems of MLDataUtils, so yeah let's start afresh! I don't like MLDataUtils2, maybe better MLData or MLDataTools, but I'm not strongly contrary either |
|
In addition, I think data loading should support multithreading. |
special case train! for the unsupervised data iterator
|
ok, so I'm going to merge this and update model-zoo's with examples more similar to real-world scenarios. As soon as we have something cooked in MLDataUtils2 we can remove the dataloader here. |
|
bors r+ |
1051: add DataLoader r=CarloLucibello a=CarloLucibello Fix #450 This adds a DataLoader type, largely adapted from the Knet one, therefore pinging @denizyuret to check if he is cool with this. Unfortunately, I cannot "unsee" his implementation, and in any case any reasonable alternative implementation will be pretty much similar I guess. This is an initial implementation to get things going, possibly in the future we will also want a distributed and out-of-memory option as the one implemented by @staticfloat here https://github.com/FluxML/Metalhead.jl/blob/sf/training/training/ImageNet/dataset.jl Co-authored-by: CarloLucibello <carlo.lucibello@gmail.com>
Build failed |
|
cuda out of memory error, let's try again bors r+ |
Build succeeded |
1057: add Julia ecosystem doc section r=CarloLucibello a=CarloLucibello Partially fixing #251, related to the discussion in #1051 . Not exactly a poem that I wrote here, maybe someone could suggest a better rephrasing. Suggestion for additional packages to add to the list also welcome Co-authored-by: CarloLucibello <carlo.lucibello@gmail.com>
|
It's not good that There are two options here: either have The former might be a good option insofar as it makes it easy to remove the |
| if d.shuffle && i == 0 | ||
| shuffle!(d.indices) | ||
| end | ||
| nexti = min(i + d.batchsize, d.nobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an issue in this line:
nexti = min(i + d.batchsize, d.nobs)
when (nexti - (i+1)) < d.batchsize, the batch data returned will not be the size of (..., batchsize) but (..., < batchsize)
I have just implemented another simple version, which just paddles the data with the very first samples in the last batch whenever its size is less than batchsize.
In my implementation, the user should give both x and y as input, because I think we should make this paired as a reminder to user when constructing loader. Please correct me if there is something wrong in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the implementation seems correct, see i >= d.imax && return nothing a few lines above. We do want to return a partial batch when partial == true. If you think it's wrong produce an example of wrong behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run the following code
julia> x = rand(10,5); y = rand(5); loader = DataLoader(x, y, batchsize = 2)
julia> for i in 1:100
for (xx, yy) in loader
@assert size(xx) == (10, 2)
@assert size(yy) == (2,)
end
end
ERROR: AssertionError: size(xx) == (10, 2)
Stacktrace:
[1] top-level scope at ./REPL[31]:3
Test the situation size(xx) == (..., < batchsize)
julia> for i in 1:100
for (xx, yy) in loader
if size(xx) != (10, 2)
@assert size(xx) == (10, 1)
end
end
end
julia> When I appended a line like:
nexti = min(i + d.batchsize, d.nobs)
(nexti - i) == d.batchsize || return nothingIt works fine then. Sorry for did not provide an example before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two things: 1) please use github code blocks and try to make your code examples more readable; 2) read more carefully DataLoader's docstring and what partial=true means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my bad! The example has been reformatted. I have read the docstring, and to my best knowledge: When partial == true and if the last minibatch is less than batchsize, it can still return a minibatch whose size is actually less than batchsize.
I think we could 1) add documentation that warn user about this situation; and/or 2) add some assertion or something like that to make sure every size(minibatch) == batchsize
When I appended a line like:
nexti = min(i + d.batchsize, d.nobs) (nexti - i) == d.batchsize || return nothingIt works fine then. Sorry for did not provide an example before.
I think doing this will ignore the effect of partial, and is not a good solution here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't see the problem here, this is a feature, not a bug: if partial=true return the last batch even if its size is less than batchsize; if partial=false drop the last batch if not of the correct size. In either case, batches before the last one will be of size batchsize.
(btw, you can use ```julia block fence to get julia syntax highlighting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the problem is with partial=true being the default, it's beacause that's the standard practive in deep-learning (e.g. see pytorch's dataloader)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your kindly note. Yes, it is not a bug but a feature.
Since I just noticed the example usgae
for epoch in 1:100
for (x, y) in train_loader:
@assert size(x) == (10, 2)
@assert size(y) == (2,)
...
end
endWhere the @assert seems to indicate the size of every minibatch returned must be batchsize. I have not dived deep in the pytorch's dataloader yet. Really sorry for that !
Maybe it's too late to chime in, but in a separate repository I've built a DataLoader that is more flexible than the one from Metalhead (i.e. works with any data container) and, most importantly uses multithreading to load data in parallel. |
|
Perhaps a good idea to benchmark it, I would love to see some optimised data loaders |
|
Sure! Though the speedup is about what you would expect, i.e. if you have 8 threads it's almost 8 times as fast plus the added bonus that the data loading happens async, so the main training loop isn't interrupted -> no GPU starvation |
|
@lorenzoh what do you think MLDataUtils/LearnBase dependency? Would you consider adding it to this repo? I see a few options here, and for all of them dropping the MLDataUtils dependency could be considered:
|
|
Right! MLDataUtils.jl is not a necessary dependency, I just have it in there because it implements If it makes sense to merge it into Flux.jl, sure If there is interest in an improved |
|
I don’t think these type of utilities should go into the main repo. I recently work with datalaoder and dataset api in mxnet and it is efficient but inextensible. Think about it, why is your data a set, not a stream? Pytorch mxnet and tensorflow have wrong data loading abstraction that assume their users’ data is finite set and can be easily enumerated, but is not. To efficiently load your data from different file store system or database, you need specialized code, to efficiently transform your data, you need utilize parallelism and pipelining, to efficiently feed data to your lovely models, you need another layer to staging data batch eagerly on host pinned memory and eagerly fetch them to the target device. They cannot fit into one paradigm easily. Now, the only library I think with correct abstraction is tensorpack’s dataflow package. I’d suggest maintaining another utility repo, include toy dataset downloader and loader. Otherwise, your main repo will finally grow into a monster like tensorflow or pytorch. |
Fix #450
This adds a DataLoader type, largely adapted from the Knet one, therefore pinging @denizyuret to check if he is cool with this. Unfortunately, I cannot "unsee" his implementation, and in any case any reasonable alternative implementation will be pretty much similar I guess.
This is an initial implementation to get things going, possibly in the future we will also want a distributed and out-of-memory option as the one implemented by @staticfloat here
https://github.com/FluxML/Metalhead.jl/blob/sf/training/training/ImageNet/dataset.jl