Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign package to be built on top of reusable dataset containers #96

Merged
merged 19 commits into from
Mar 5, 2022

Conversation

darsnack
Copy link
Member

@darsnack darsnack commented Feb 19, 2022

This works towards #73 by adding low-level dataset containers that are MLUtils.jl compatible out of the box. The end goal is to rewrite the existing datasets to use these containers, but this PR will just add the containers without changing the existing functionality that users expect. In a followup PR, I will rewrite the existing API, provide a deprecation path, and add better documentation (docstrings will still be added here first).

So far, I have ported only the containers from FastAI.jl. I intend to implement the following before merging this PR:

Low level

  • FileDataset: lazy access to a vector of files on disk
  • TableDataset: any Tables.jl compatible table
  • Text: this one I need to decide how to best reuse JuliaText similar to how TableDataset wraps Tables.jl
  • HDF5Dataset: built on HDF5.jl
  • JLD2Dataset: built on JLD2.jl
  • MATDataset: built on MAT.jl the data can just be read in as a Dict and used with MLUtils.jl without any extra effort (there is no value to having a wrapper here)

Mid level

  • CachedDataset: wraps a lazy dataset like FileDataset so that it stays in memory (this is how small datasets like MNIST will be)
  • GroupedDataset: wraps several named containers together (e.g. for "train" + "val" data) MLUtils already has functionality for this

@codecov-commenter
Copy link

codecov-commenter commented Feb 19, 2022

Codecov Report

Merging #96 (2ff0d29) into master (f45ef65) will increase coverage by 4.91%.
The diff coverage is 93.67%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #96      +/-   ##
==========================================
+ Coverage   31.10%   36.02%   +4.91%     
==========================================
  Files          34       38       +4     
  Lines         807      880      +73     
==========================================
+ Hits          251      317      +66     
- Misses        556      563       +7     
Impacted Files Coverage Δ
src/MLDatasets.jl 43.75% <ø> (ø)
src/containers/filedataset.jl 80.00% <80.00%> (ø)
src/containers/tabledataset.jl 89.28% <89.28%> (ø)
src/containers/cacheddataset.jl 100.00% <100.00%> (ø)
src/containers/hdf5dataset.jl 100.00% <100.00%> (ø)
src/containers/jld2dataset.jl 100.00% <100.00%> (ø)
src/CIFAR10/CIFAR10.jl 50.00% <0.00%> (-10.00%) ⬇️
src/MNIST/MNIST.jl 66.66% <0.00%> (-8.34%) ⬇️
src/SVHN2/SVHN2.jl 66.66% <0.00%> (-8.34%) ⬇️
src/CIFAR100/CIFAR100.jl 66.66% <0.00%> (-8.34%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f45ef65...2ff0d29. Read the comment docs.

@CarloLucibello
Copy link
Member

Given JuliaML/MLUtils.jl#56, should we avoid depending on MLUtils and just define length and getindex?

@CarloLucibello
Copy link
Member

We should drop julia < 1.6

@darsnack
Copy link
Member Author

The way I wrote AbstractDataContainer makes the reverse true as well: anything that supports getobs/numobs automatically gets getindex, length, and iterate.

If the MLUtils.jl dependency is too expensive, I'd rather move ahead with splitting the interface into its own base package. My view is that MLDatasets should be a first-class consumer of the MLUtils interface. I also think that MLUtils might be a cheaper dep than some of the other IO related packages. Though this is a good question to consider: should there be a "just the data deps" package that splits out only the download/unpack functions?

@darsnack
Copy link
Member Author

darsnack commented Feb 22, 2022

I decided against incorporating text in this PR since it involves a temporal dimension that we haven't yet discussed as part of the MLUtils.jl refactor. I also expect that the best option will involve factoring out text datasets similar to how we discussed for vision.

Would also be useful to get @lorenzoh's eyes on this PR.

source::T
cacheidx::Vector{Int}
cache::S
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go in MLUtils.jl?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think so

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make a PR though this is only useful for data that isn't already in memory. I had trouble thinking of cases where that's true but the data isn't a dataset.

src/containers/cacheddataset.jl Outdated Show resolved Hide resolved
src/containers/cacheddataset.jl Outdated Show resolved Hide resolved
src/containers/cacheddataset.jl Outdated Show resolved Hide resolved
end

function CachedDataset(source, cachesize::Int = numobs(source))
cacheidx = 1:cachesize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storing the first chachesize entries seems totally arbitrary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any subset of the indices would be arbitrary. The main reason for including this is to address FFCV from this issue. My lab mate is using FFCV with PyTorch on ImageNet right now. We strongly suspect that the overwhelming majority of the performance gains are from caching a portion of the dataset in memory (my lab mate will request 400GB on our cluster). In this case, the particular indices don't matter so much and how many, so the first N seems as good as any.

src/containers/filedataset.jl Outdated Show resolved Hide resolved
src/containers/filedataset.jl Show resolved Hide resolved
Copy link
Contributor

@lorenzoh lorenzoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, mostly on the uglier bits I put into FastAI.jl.

Beside those things, I think we should standardize on the type we use to represent file paths, i.e. String or AbstractPath. Even if we want to allow passing both, I think it makes sense to use the same repr. internally.

In favor of Strings:

  • bit simpler
  • the globbing only returns files and I think converting these to Paths consumes more memory

In favor of AbstractPaths:

  • more thoughtful API
  • more extensible (e.g. other types of file systems like S3)
  • less ambiguous

What do you think?

source::T
cacheidx::Vector{Int}
cache::S
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think so

src/containers/filedataset.jl Show resolved Hide resolved
src/containers/filedataset.jl Outdated Show resolved Hide resolved

Load a file from disk into the appropriate format.
"""
function loadfile(file::String)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if a bunch of nested ifs is the most elegant solution for this. Since this was meant to be a high-level function for FastAI.jl, maybe we can do without it for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just go with FileIO.load then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's the best for now. But let's re-export load instead of creating a new loadfile function

@darsnack
Copy link
Member Author

I went with AbstractString because it doesn't seem like FilePathsBase.jl is used frequently enough in the ecosystem.

@ablaom
Copy link

ablaom commented Feb 24, 2022

@darsnack Do the wrapped TableDataset objects themselves implement the Tables.jl interface? That is, is every TableDataset(some_table) also a table?

@darsnack
Copy link
Member Author

Not at this moment, but based on our discussion it's clear that it should! I'll add this tomorrow.

@ablaom
Copy link

ablaom commented Feb 24, 2022

If not already, I suggest the wrapper be parameterised by the type of the table being wrapped. See continuation of same discussion.

@darsnack
Copy link
Member Author

Yeah for exactly the reasons in the discussion, we already do this. And we have faster paths for when the underlying table is a DataFrame or CSV.File.

@ablaom
Copy link

ablaom commented Feb 24, 2022

I see. That works here, but not ideal if we were to move this to MLUtils (say) because we would need DataFrames and CSV as dependencies. Those getobs implementations should ideally live in those packages (as does the Tables interface), right?

Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>
@CarloLucibello
Copy link
Member

CI fails due to a missing using FileIO I guess. Besides that, this looks good to me

@darsnack
Copy link
Member Author

darsnack commented Mar 5, 2022

Okay this is probably good to go now. TableDataset is now a Tables.jl table. We can continue the discussion around tables in MLUtils.jl and update here as needed. But for now, this will allow us to move forward in this repo with the refactor.

@CarloLucibello
Copy link
Member

You can remove strict = true from docs/make.jl if you don't want to add this stuff to the documentation right now.

@darsnack
Copy link
Member Author

darsnack commented Mar 5, 2022

I just added a WIP page so that the docstrings are at least there should someone choose to use the dev branch. I also changed the behavior to only check docstrings for exported functions (since I added rglob which we probably don't want in the actual docs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants