Skip to content

Conversation

@CarloLucibello
Copy link
Member

Fix #450

This adds a DataLoader type, largely adapted from the Knet one, therefore pinging @denizyuret to check if he is cool with this. Unfortunately, I cannot "unsee" his implementation, and in any case any reasonable alternative implementation will be pretty much similar I guess.

This is an initial implementation to get things going, possibly in the future we will also want a distributed and out-of-memory option as the one implemented by @staticfloat here
https://github.com/FluxML/Metalhead.jl/blob/sf/training/training/ImageNet/dataset.jl

@DhairyaLGandhi
Copy link
Member

I very much like the metalhead one too, hopefully having that would solve a lot of data management

Xtrain = rand(10, 100)
Ytrain = rand(100)
dtrain = DataLoader(Xtrain, Ytrain, batchsize=2, shuffle=true)
for epoch in 1:100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use train!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this example since I want to show that dtrain (also with shuffle=true) can be reused.
Added an example with train! below.

export Iris



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random space

end
end
"""
function DataLoader(xs...; batchsize=1, shuffle=false, partial=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splatting here seems a bit odd, and shouldn't we make it an iterator/ generator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't come up with a generator based solution also supporting shuffle and partial.
Splatting is there in order to support

DataLoader(X)
DataLoader(X, Y)
DataLoader(X, Y, Z)
....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like a sampler over the batch dimension could act as a surrogate, no?

Also, are the X Y and Z supposed to be minibatches? Similar to a splatted version of the data that we send to train! Currently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X and Y are inputs and labels tensor respectively, essentially they will be partitioned and zipped together. The examples should be quite clear, I could improve the docstring though

@AriMKatz
Copy link

AriMKatz commented Feb 26, 2020

Can we make this an abstract type so users can define their own dataloaders for things like video etc ala pytorch. Also have you seen https://github.com/JuliaML/MLDataUtils.jl?

@denizyuret
Copy link

denizyuret commented Feb 26, 2020 via email

@CarloLucibello
Copy link
Member Author

Can we make this an abstract type so users can define their own dataloaders for things like video etc ala pytorch.

For the time being, I don't see much need for an abstract interface, but we can revisit when the need arises

Also have you seen https://github.com/JuliaML/MLDataUtils.jl?

Yes but those repos have very low activity. Also, I prefer to have an in-house solution to tune around Flux specific neeed. Again, we can revisit in case JuliaML becomes more actively maintained

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Feb 26, 2020

Mhmh, for composability with Flux.train! DataLoader(X) should yield during iteratations 1-element tuples (x,) while now it just yields x. On the other hand, I'm not a huge fan of having to loop like this:

for (x,) in data
  ...
end 

and I prefer

for x in data
  ...
end 

Maybe we can modify train! to handle non-tuple arguments without splatting, although that would be breaking.

@CarloLucibello CarloLucibello force-pushed the data branch 2 times, most recently from eb16d85 to 1b5197c Compare February 27, 2020 19:39
@CarloLucibello
Copy link
Member Author

CarloLucibello commented Feb 27, 2020

I ended up modifying train! as follows: if the data iterator yields array minibatches d then compute loss(d), otherwise compute loss(d...). In fact, it seems most natural to me that a data iterator for unsupervised learning would yield x instead of (x,).

This PR now is breaking, but I think only slightly so, since I special cased just d isa AbstractArray{<:Number}, I don't think anyone will complain

@CarloLucibello
Copy link
Member Author

@MikeInnes this is very slightly breaking, I don't think we even need to bump Flux's major version. Are you ok with merging it?

try
gs = gradient(ps) do
loss(d...)
if d isa AbstractArray{<:Number}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry this will casuse problems e.g. with
Array{Union{Missing, Float64}} that contains no missing values.

It is pretty easy in general to end-up with these, since they come out of DataFrames fairly readily.

Copy link
Member Author

@CarloLucibello CarloLucibello Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be addressed by d isa Array{<:Union{Number,Missing}} but I think that would be part of data preprocessing, same as converting image types to numeric types, so I'm not sure we should support that since flux models won't play nice with missings.

backpropagation and calls the optimizer `opt`.
In case datapoints `d` are of numeric array type, assumes no splatting is needed
and computes the gradient of `loss(d)`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always findworking out how train is going to iterate my data and pass it to the loss confusing.
This seems like it is going to make it even more confusing.

Copy link
Member Author

@CarloLucibello CarloLucibello Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, I would ditch train! entirely, I never use it in my code, plain loops are much more transparent and remove the need for callbacks' gymnastic

@oxinabox
Copy link
Member

oxinabox commented Feb 28, 2020

Why are we reimplementing MLDataPattern with-in Flux?

Is it because we want to make it more batteries included? If so maybe we can just reexport it?

Is it because MLDataPattern is a bit crufty and old ? If so, other good options include updating it, and/or making a new competing package.
I would be a big fan of either, as a maintainer of MLDataUtil/MLDataPattern.
It is mostly just mature and not needing much work these days (thus looks kinda inactive), but I do think there is opertunity to clean the API somewhat now that we have constant folding, and specialization on kwargs.

One of the great things about Flux/Julia is that it doesn't need everything to be reimplemented just for it, and it can use the whole ecosystem.


Here is an example of me using MLDataPattern/MLDataUtils with Flux.
https://github.com/oxinabox/oxinabox.github.io/blob/master/_drafts/JuliaDeepLearningMeetupLondon2019/Regression%20Demo.ipynb
(and TensorFlow.jl)

There are more examples here, though apparently the Flux example (unlike the others) is not using MLDataUtils.(I think because I get confused by Flux.train!)
https://github.com/oxinabox/oxinabox.github.io/blob/master/_drafts/JuliaDeepLearningMeetupLondon2019/Classification%20Demo%20--%20Am%20I%20dressed%3F.ipynb
Here is one more of it with TensorFlow.jl just for fun
https://github.com/oxinabox/oxinabox.github.io/blob/master/_drafts/JuliaDeepLearningMeetupLondon2019/Examples.ipynb

@GaiBrutman
Copy link

I agree with @oxinabox, MLDataUtils.jl provides a general implementation of data-related functionality.
Any type that implements the Container interface can be batched using RandomBatches or eachbatch() - which is basically what the proposed DataLoader does.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Feb 28, 2020

I agree as well, reviving MLDataUtils.jl would be ideal. I'm not entirely familiar with the codebase and it looks hard for me to contribute there: it has many dependencies, a complicated codebase, makes heavy usage of custom types, and some parts are (or feel) outdated.

I think it is quite urgent to have in the Flux (or ML in general) ecosystem a simple data iterator, that takes arrays as inputs for the constructor, and supports batchsize, shuffle and partial:

DataLoader(X, batchsize=128, shuffle=true)
DataLoader(X, Y, batchsize=128,  partial=false)

Can we quickly cook something like that in MLDataUtils so that we can "sponsor" it in Flux?

OneHotEncoding presented a similar case: instead of using the one in MLLabelUtils, people implemented an in-house solution here, probably because that was just simpler and more manageable.

Again, I'm not utterly familiar with the JuliaML codebase, but it just feels very complicated. Maybe starting with a new repo, e.g. MLDataTools?, would be the best path forward. Things could go as follows:

  • Excise the datasets in Flux.Data from Flux. MLDatasets is a valid alternative, fully functional and does not depend on the rest of JuliaML ecosystem. This can be done immediately.
  • Excise OneHotEncoding and DataLoader (if this gets in) from Flux in favor of the new MLDataTools or a revamped MLDataUtils. This will take some thought and effort.

I would be in favor of starting a new repo with minimal functionality, just onehot encoding and dataloader, and then start bringing in functionality from MLDataUtils while keeping it lean and simple. @oxinabox Is this reasonable or you think revamping the JuliaML infrastructure is a better way forward?

In any case, if a functionally equivalent alternative to the DataLoader here does not materialize itself within a week or two, I would still be in favor of merging this PR and eventually deprecate it later when we have a valid alternative.

@johnnychen94
Copy link
Contributor

johnnychen94 commented Feb 28, 2020

TL;DR Here's the more or less equivalent version written in MLDataPattern

function DataLoader(data...; batchsize=1, shuffle=false, partial=true)    
    count = partial ? ceil(Int, length(data[1])/batchsize) : floor(Int, length(data[1])/batchsize)
	data = shuffle ? shuffleobs(data) : data
    batchview(data, batchsize, count, ObsDim.Last())
end

The whole codebase there looks complicated at first glance because its design aims to more flexible and composable functionalities. And I would say it's documented very well (that I don't even have a chance/reason to rebuild it with Documenter). What's needed IMO might be some tailored demos for Flux's usage.


As for the transform feature, since Augmentor is now 1.x compatible, it's also an easy task:

function DataLoader(data... batchsize=1, shuffle=false, partial=true, transform=NoOp())
    count = partial ? ceil(Int, length(data[1])/batchsize) : floor(Int, length(data[1])/batchsize)
	data = shuffle ? shuffleobs(data) : data
    batchview(augment(data, transform), batchsize, count, ObsDim.Last())
end

@CarloLucibello
Copy link
Member Author

TL;DR Here's the more or less equivalent version written in MLDataPattern

function DataLoader(data...; batchsize=1, shuffle=false, partial=true)    
    count = partial ? ceil(Int, length(data[1])/batchsize) : floor(Int, length(data[1])/batchsize)
	data = shuffle ? shuffleobs(data) : data
    batchview(data, batchsize, count, ObsDim.Last())
end

this wouldn't reshuffle at every epoch though

@johnnychen94
Copy link
Contributor

johnnychen94 commented Feb 28, 2020

this wouldn't reshuffle at every epoch though

You know, it's completely doable

function DataLoader(data...; batchsize=1, shuffle=false, partial=true)    
    count = partial ? ceil(Int, length(data[1])/batchsize) : floor(Int, length(data[1])/batchsize)
    iter = shuffle ? RandomBatches : BatchView
    iter(data, batchsize, count, ObsDim.Last())
end

So far I just think the DataLoader abstraction doesn't make things easier; except for users that sticks to the Pytorch experience :P

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Feb 28, 2020

Ok, getting there

julia> X = reshape([1:10;], (2, 5))
2×5 Array{Int64,2}:
 1  3  5  7   9
 2  4  6  8  10

julia> function DataLoader(data...; batchsize=1, shuffle=false, partial=true)    
           s = size(data[1])[end]/batchsize
           count = partial ? ceil(Int, s) : floor(Int, s)
           @show count
           iter = shuffle ? RandomBatches : BatchView
           iter(data, batchsize, count, ObsDim.Last())
       end
DataLoader (generic function with 1 method)

# this is OK
julia> DataLoader(X, batchsize=2, partial=false, shuffle=false) |> collect
count = 2
2-element Array{Tuple{SubArray{Int64,2,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64}},true}},1}:
 ([1 3; 2 4],)
 ([5 7; 6 8],)

# here what we want is sampling WITHOUT repetition
julia> DataLoader(X, batchsize=2, partial=false, shuffle=true) |> collect
count = 2
2-element Array{Tuple{SubArray{Int64,2,Array{Int64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1}},false}},1}:
 ([1 3; 2 4],)
 ([5 3; 6 4],)

# ???
julia> DataLoader(X, batchsize=2, partial=true, shuffle=false) |> collect
count = 3
ERROR: ArgumentError: Specified number of partitions is too large for the specified size
Stacktrace:
 [1] _compute_batch_settings(::Tuple{Array{Int64,2}}, ::Int64, ::Int64, ::LearnBase.ObsDim.Last, ::Bool) at /home/carlo/.julia/packages/MLDataPattern/mX21p/src/dataview.jl:197
 [2] BatchView at /home/carlo/.julia/packages/MLDataPattern/mX21p/src/dataview.jl:359 [inlined] (repeats 2 times)
 [3] #DataLoader#25(::Int64, ::Bool, ::Bool, ::typeof(DataLoader), ::Array{Int64,2}) at ./REPL[62]:6
 [4] (::var"#kw##DataLoader")(::NamedTuple{(:batchsize, :partial, :shuffle),Tuple{Int64,Bool,Bool}}, ::typeof(DataLoader), ::Array{Int64,2}) at ./none:0
 [5] top-level scope at REPL[65]:1

What I've seen up to now is that pytorch dataloader experience is smoother than anything else we have here

@CarloLucibello
Copy link
Member Author

This PR supersedes #675 #450, and closes #317 #278 #220

Old issue for integration with MLDataUtils.jl here #251

@oxinabox
Copy link
Member

I think a new repo is reasonable (I am a fan of the name MLDataUtils2, following tradition of JSON2)
MLDataUtils works fine but does show its age, given it was developed for 0.5
Main age issues:

  • Broken up into many subpackages, because package manager/loading used to be really slow
  • A lot of use of types for things, because no constant folding
  • A lot of effort to make using kwargs not required because of them not being specialized on.
  • Doesn't support Tables.jl (becuase that would not be created for several more years.)
  • Excessive use of @inferred in tests because that seems a good idea at the time but now I think can be more chill.

Now these could be fixed in the current repo.
But sometimes it is nice to start fresh. (and sometimes it isn't).

@CarloLucibello
Copy link
Member Author

it seems you clearly outlined the current design problems of MLDataUtils, so yeah let's start afresh! I don't like MLDataUtils2, maybe better MLData or MLDataTools, but I'm not strongly contrary either

@GaiBrutman
Copy link

In addition, I think data loading should support multithreading.

@CarloLucibello
Copy link
Member Author

ok, so I'm going to merge this and update model-zoo's with examples more similar to real-world scenarios. As soon as we have something cooked in MLDataUtils2 we can remove the dataloader here.

@CarloLucibello
Copy link
Member Author

bors r+

bors bot added a commit that referenced this pull request Feb 29, 2020
1051: add DataLoader r=CarloLucibello a=CarloLucibello

Fix #450 

This adds a DataLoader type, largely adapted from the Knet one, therefore pinging @denizyuret to check if he is cool with this. Unfortunately, I cannot "unsee" his implementation, and in any case any reasonable alternative implementation will be pretty much similar I guess. 

This is an initial implementation to get things going, possibly in the future we will also want a distributed and out-of-memory option as the one implemented by @staticfloat here
https://github.com/FluxML/Metalhead.jl/blob/sf/training/training/ImageNet/dataset.jl



Co-authored-by: CarloLucibello <carlo.lucibello@gmail.com>
@bors
Copy link
Contributor

bors bot commented Feb 29, 2020

Build failed

@CarloLucibello
Copy link
Member Author

cuda out of memory error, let's try again

bors r+

@bors
Copy link
Contributor

bors bot commented Feb 29, 2020

Build succeeded

@bors bors bot merged commit 77a7606 into FluxML:master Feb 29, 2020
bors bot added a commit that referenced this pull request Mar 2, 2020
1057: add Julia ecosystem doc section r=CarloLucibello a=CarloLucibello

Partially fixing #251,  related to the discussion in #1051 .

Not exactly a poem that I wrote here, maybe someone could suggest a better rephrasing. 
Suggestion for additional packages to add to the list also welcome

Co-authored-by: CarloLucibello <carlo.lucibello@gmail.com>
@MikeInnes
Copy link
Member

It's not good that DataLoader(X) returns batches of X but DataLoader(X, Y) returns a tuples of batches. The special case in train! is not great either, since it's supposed to be simple and obvious.

There are two options here: either have DataLoader(X) return a tuple, or only allow one input but use DataLoader((X, Y)) to return tuples. Then one can write DataLoader((Xs...,)) without ambiguity about what the type will be.

The former might be a good option insofar as it makes it easy to remove the train! special case. It'd also be easy to tell people to write train!(..., zip(loader), ...) to wrap the iterator output in a tuple.

if d.shuffle && i == 0
shuffle!(d.indices)
end
nexti = min(i + d.batchsize, d.nobs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an issue in this line:
nexti = min(i + d.batchsize, d.nobs)
when (nexti - (i+1)) < d.batchsize, the batch data returned will not be the size of (..., batchsize) but (..., < batchsize)

I have just implemented another simple version, which just paddles the data with the very first samples in the last batch whenever its size is less than batchsize.

In my implementation, the user should give both x and y as input, because I think we should make this paired as a reminder to user when constructing loader. Please correct me if there is something wrong in my mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implementation seems correct, see i >= d.imax && return nothing a few lines above. We do want to return a partial batch when partial == true. If you think it's wrong produce an example of wrong behavior

Copy link
Contributor

@wenjie-p wenjie-p Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run the following code

julia> x = rand(10,5); y = rand(5); loader = DataLoader(x, y, batchsize = 2)
julia> for i in 1:100
           for (xx, yy) in loader
               @assert size(xx) == (10, 2)
               @assert size(yy) == (2,)
           end
       end
ERROR: AssertionError: size(xx) == (10, 2)
Stacktrace:
 [1] top-level scope at ./REPL[31]:3

Test the situation size(xx) == (..., < batchsize)

julia> for i in 1:100
           for (xx, yy) in loader
               if size(xx) != (10, 2)
                   @assert size(xx) == (10, 1)
               end
           end
       end

julia> 

When I appended a line like:

nexti = min(i + d.batchsize, d.nobs)
(nexti - i) == d.batchsize || return nothing

It works fine then. Sorry for did not provide an example before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two things: 1) please use github code blocks and try to make your code examples more readable; 2) read more carefully DataLoader's docstring and what partial=true means

Copy link
Contributor

@wenjie-p wenjie-p Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my bad! The example has been reformatted. I have read the docstring, and to my best knowledge: When partial == true and if the last minibatch is less than batchsize, it can still return a minibatch whose size is actually less than batchsize.

I think we could 1) add documentation that warn user about this situation; and/or 2) add some assertion or something like that to make sure every size(minibatch) == batchsize

When I appended a line like:

nexti = min(i + d.batchsize, d.nobs)
(nexti - i) == d.batchsize || return nothing

It works fine then. Sorry for did not provide an example before.

I think doing this will ignore the effect of partial, and is not a good solution here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't see the problem here, this is a feature, not a bug: if partial=true return the last batch even if its size is less than batchsize; if partial=false drop the last batch if not of the correct size. In either case, batches before the last one will be of size batchsize.
(btw, you can use ```julia block fence to get julia syntax highlighting)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the problem is with partial=true being the default, it's beacause that's the standard practive in deep-learning (e.g. see pytorch's dataloader)

Copy link
Contributor

@wenjie-p wenjie-p Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your kindly note. Yes, it is not a bug but a feature.

Since I just noticed the example usgae

for epoch in 1:100
    for (x, y) in train_loader: 
        @assert size(x) == (10, 2)
        @assert size(y) == (2,)
        ...
    end
end

Where the @assert seems to indicate the size of every minibatch returned must be batchsize. I have not dived deep in the pytorch's dataloader yet. Really sorry for that !

@lorenzoh
Copy link
Member

@MikeInnes

This is an initial implementation to get things going, possibly in the future we will also want a distributed and out-of-memory option

Maybe it's too late to chime in, but in a separate repository I've built a DataLoader that is more flexible than the one from Metalhead (i.e. works with any data container) and, most importantly uses multithreading to load data in parallel.
It's not stable yet, but I have used it to train large vision models with heavy data processing pipelines and it's been able to keep the GPU fully utilized.

https://github.com/lorenzoh/DataLoaders.jl

@DhairyaLGandhi
Copy link
Member

Perhaps a good idea to benchmark it, I would love to see some optimised data loaders

@lorenzoh
Copy link
Member

Sure!

Though the speedup is about what you would expect, i.e. if you have 8 threads it's almost 8 times as fast plus the added bonus that the data loading happens async, so the main training loop isn't interrupted -> no GPU starvation

@CarloLucibello
Copy link
Member Author

@lorenzoh what do you think MLDataUtils/LearnBase dependency? Would you consider adding it to this repo? I see a few options here, and for all of them dropping the MLDataUtils dependency could be considered:

  1. Add the @lorenzoh's DataLoader to this repo
  2. Reexport his package from this repo
  3. Deprecate our current DataLoader and use https://github.com/lorenzoh/DataLoaders.jl in docs and model-zoo

@lorenzoh
Copy link
Member

Right! MLDataUtils.jl is not a necessary dependency, I just have it in there because it implements getobs for some common types like arrays and tuples of arrays.

If it makes sense to merge it into Flux.jl, sure
Though I need to reiterate that it needs more testing, some cleaning up, and I'm not sure if the error handling is correct yet.

If there is interest in an improved DataLoader I'd be happy to work on those things though, also pull requests are very welcome :)

@cloudhan
Copy link

I don’t think these type of utilities should go into the main repo. I recently work with datalaoder and dataset api in mxnet and it is efficient but inextensible. Think about it, why is your data a set, not a stream? Pytorch mxnet and tensorflow have wrong data loading abstraction that assume their users’ data is finite set and can be easily enumerated, but is not. To efficiently load your data from different file store system or database, you need specialized code, to efficiently transform your data, you need utilize parallelism and pipelining, to efficiently feed data to your lovely models, you need another layer to staging data batch eagerly on host pinned memory and eagerly fetch them to the target device. They cannot fit into one paradigm easily. Now, the only library I think with correct abstraction is tensorpack’s dataflow package.

I’d suggest maintaining another utility repo, include toy dataset downloader and loader. Otherwise, your main repo will finally grow into a monster like tensorflow or pytorch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.