Redesign package to be built on top of reusable dataset containers #96

darsnack · 2022-02-19T18:23:50Z

This works towards #73 by adding low-level dataset containers that are MLUtils.jl compatible out of the box. The end goal is to rewrite the existing datasets to use these containers, but this PR will just add the containers without changing the existing functionality that users expect. In a followup PR, I will rewrite the existing API, provide a deprecation path, and add better documentation (docstrings will still be added here first).

So far, I have ported only the containers from FastAI.jl. I intend to implement the following before merging this PR:

Low level

FileDataset: lazy access to a vector of files on disk
TableDataset: any Tables.jl compatible table
Text: this one I need to decide how to best reuse JuliaText similar to how TableDataset wraps Tables.jl
HDF5Dataset: built on HDF5.jl
JLD2Dataset: built on JLD2.jl
~~MATDataset: built on MAT.jl~~ the data can just be read in as a Dict and used with MLUtils.jl without any extra effort (there is no value to having a wrapper here)

Mid level

CachedDataset: wraps a lazy dataset like FileDataset so that it stays in memory (this is how small datasets like MNIST will be)
~~GroupedDataset: wraps several named containers together (e.g. for "train" + "val" data)~~ MLUtils already has functionality for this

codecov-commenter · 2022-02-19T18:29:02Z

Codecov Report

Merging #96 (2ff0d29) into master (f45ef65) will increase coverage by 4.91%.
The diff coverage is 93.67%.

@@            Coverage Diff             @@
##           master      #96      +/-   ##
==========================================
+ Coverage   31.10%   36.02%   +4.91%     
==========================================
  Files          34       38       +4     
  Lines         807      880      +73     
==========================================
+ Hits          251      317      +66     
- Misses        556      563       +7

Impacted Files	Coverage Δ
src/MLDatasets.jl	`43.75% <ø> (ø)`
src/containers/filedataset.jl	`80.00% <80.00%> (ø)`
src/containers/tabledataset.jl	`89.28% <89.28%> (ø)`
src/containers/cacheddataset.jl	`100.00% <100.00%> (ø)`
src/containers/hdf5dataset.jl	`100.00% <100.00%> (ø)`
src/containers/jld2dataset.jl	`100.00% <100.00%> (ø)`
src/CIFAR10/CIFAR10.jl	`50.00% <0.00%> (-10.00%)`	⬇️
src/MNIST/MNIST.jl	`66.66% <0.00%> (-8.34%)`	⬇️
src/SVHN2/SVHN2.jl	`66.66% <0.00%> (-8.34%)`	⬇️
src/CIFAR100/CIFAR100.jl	`66.66% <0.00%> (-8.34%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f45ef65...2ff0d29. Read the comment docs.

CarloLucibello · 2022-02-19T18:30:48Z

Given JuliaML/MLUtils.jl#56, should we avoid depending on MLUtils and just define length and getindex?

CarloLucibello · 2022-02-19T18:31:56Z

We should drop julia < 1.6

darsnack · 2022-02-19T18:37:15Z

The way I wrote AbstractDataContainer makes the reverse true as well: anything that supports getobs/numobs automatically gets getindex, length, and iterate.

If the MLUtils.jl dependency is too expensive, I'd rather move ahead with splitting the interface into its own base package. My view is that MLDatasets should be a first-class consumer of the MLUtils interface. I also think that MLUtils might be a cheaper dep than some of the other IO related packages. Though this is a good question to consider: should there be a "just the data deps" package that splits out only the download/unpack functions?

…ests

darsnack · 2022-02-22T22:56:25Z

I decided against incorporating text in this PR since it involves a temporal dimension that we haven't yet discussed as part of the MLUtils.jl refactor. I also expect that the best option will involve factoring out text datasets similar to how we discussed for vision.

Would also be useful to get @lorenzoh's eyes on this PR.

CarloLucibello · 2022-02-23T04:10:45Z

src/containers/cacheddataset.jl

+    source::T
+    cacheidx::Vector{Int}
+    cache::S
+end


Should this go in MLUtils.jl?

I can make a PR though this is only useful for data that isn't already in memory. I had trouble thinking of cases where that's true but the data isn't a dataset.

src/containers/cacheddataset.jl

CarloLucibello · 2022-02-23T04:20:54Z

src/containers/cacheddataset.jl

+end
+
+function CachedDataset(source, cachesize::Int = numobs(source))
+    cacheidx = 1:cachesize


storing the first chachesize entries seems totally arbitrary.

Any subset of the indices would be arbitrary. The main reason for including this is to address FFCV from this issue. My lab mate is using FFCV with PyTorch on ImageNet right now. We strongly suspect that the overwhelming majority of the performance gains are from caching a portion of the dataset in memory (my lab mate will request 400GB on our cluster). In this case, the particular indices don't matter so much and how many, so the first N seems as good as any.

src/containers/filedataset.jl

lorenzoh

Left some comments, mostly on the uglier bits I put into FastAI.jl.

Beside those things, I think we should standardize on the type we use to represent file paths, i.e. String or AbstractPath. Even if we want to allow passing both, I think it makes sense to use the same repr. internally.

In favor of Strings:

bit simpler
the globbing only returns files and I think converting these to Paths consumes more memory

In favor of AbstractPaths:

more thoughtful API
more extensible (e.g. other types of file systems like S3)
less ambiguous

What do you think?

lorenzoh · 2022-02-23T10:26:12Z

src/containers/cacheddataset.jl

+    source::T
+    cacheidx::Vector{Int}
+    cache::S
+end


src/containers/filedataset.jl

lorenzoh · 2022-02-23T10:31:14Z

src/containers/filedataset.jl

+
+Load a file from disk into the appropriate format.
+"""
+function loadfile(file::String)


Not sure if a bunch of nested ifs is the most elegant solution for this. Since this was meant to be a high-level function for FastAI.jl, maybe we can do without it for now?

Should I just go with FileIO.load then?

Yeah, I think that's the best for now. But let's re-export load instead of creating a new loadfile function

darsnack · 2022-02-23T14:23:26Z

I went with AbstractString because it doesn't seem like FilePathsBase.jl is used frequently enough in the ecosystem.

ablaom · 2022-02-24T03:54:47Z

@darsnack Do the wrapped TableDataset objects themselves implement the Tables.jl interface? That is, is every TableDataset(some_table) also a table?

darsnack · 2022-02-24T04:34:00Z

Not at this moment, but based on our discussion it's clear that it should! I'll add this tomorrow.

ablaom · 2022-02-24T05:52:57Z

If not already, I suggest the wrapper be parameterised by the type of the table being wrapped. See continuation of same discussion.

darsnack · 2022-02-24T17:33:46Z

Yeah for exactly the reasons in the discussion, we already do this. And we have faster paths for when the underlying table is a DataFrame or CSV.File.

ablaom · 2022-02-24T20:51:12Z

I see. That works here, but not ideal if we were to move this to MLUtils (say) because we would need DataFrames and CSV as dependencies. Those getobs implementations should ideally live in those packages (as does the Tables interface), right?

src/containers/tabledataset.jl

Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>

CarloLucibello · 2022-03-05T10:15:55Z

CI fails due to a missing using FileIO I guess. Besides that, this looks good to me

darsnack · 2022-03-05T15:18:46Z

Okay this is probably good to go now. TableDataset is now a Tables.jl table. We can continue the discussion around tables in MLUtils.jl and update here as needed. But for now, this will allow us to move forward in this repo with the refactor.

CarloLucibello · 2022-03-05T15:24:27Z

You can remove strict = true from docs/make.jl if you don't want to add this stuff to the documentation right now.

darsnack · 2022-03-05T15:51:01Z

I just added a WIP page so that the docstrings are at least there should someone choose to use the dev branch. I also changed the behavior to only check docstrings for exported functions (since I added rglob which we probably don't want in the actual docs).

Initial port of FastAI dataset containers

834d9b8

darsnack added 2 commits February 19, 2022 12:38

Drop Julia < 1.6 support and add bounds

83452ed

Add some docstrings and test for FileDataset

757b706

CarloLucibello mentioned this pull request Feb 20, 2022

circularity in AbstractDataContainer definitions JuliaML/MLUtils.jl#57

Closed

darsnack added 8 commits February 20, 2022 12:18

Add HDF5 dataset

4e6807f

Close HDF5 file before deleting in tests

9a45312

Update HDF5 docstrings

3ecc35b

Fix broken HDF5 string tests

7bba084

Add JLD2Dataset

92e0d06

Support custom loading function in FileDataset

3d48893

Add CachedDataset

a30bfab

Special case single path HDF5 and JLD2 datasets and add @inferred t…

b7ba9c4

…ests

darsnack force-pushed the low-level-api branch from e654cd3 to b7ba9c4 Compare February 22, 2022 22:32

darsnack requested a review from CarloLucibello February 22, 2022 22:49

Add more compat entries

7db9468

CarloLucibello reviewed Feb 23, 2022

View reviewed changes

lorenzoh reviewed Feb 23, 2022

View reviewed changes

darsnack added 2 commits February 23, 2022 08:13

Switch to MLUtils v0.2

950e111

Get rid of cruft from FastAI.jl

ebc6938

Remove some outdated deps

15fa280

darsnack mentioned this pull request Feb 23, 2022

Can MLUtils play nicely with Tables.jl? JuliaML/MLUtils.jl#61

Closed

lorenzoh reviewed Mar 2, 2022

View reviewed changes

src/containers/tabledataset.jl Outdated Show resolved Hide resolved

Remove reference to AbstractPath

860fc14

Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>

darsnack added 2 commits March 5, 2022 08:56

Add FileIO into usings

5a882bd

Fix tests and add Tables.jl interface

df48e5e

darsnack force-pushed the low-level-api branch from 02e07b5 to 2cc1f28 Compare March 5, 2022 15:56

CarloLucibello approved these changes Mar 5, 2022

View reviewed changes

darsnack force-pushed the low-level-api branch from 2cc1f28 to c346b4c Compare March 5, 2022 16:03

Fix doc errors

2ff0d29

darsnack force-pushed the low-level-api branch from c346b4c to 2ff0d29 Compare March 5, 2022 16:17

darsnack merged commit 42cc48f into JuliaML:master Mar 5, 2022

darsnack deleted the low-level-api branch March 5, 2022 16:37

This was referenced Mar 20, 2022

add SupervisedDataset and move misc. datasets to the new interface #98

Merged

redesign the package #73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign package to be built on top of reusable dataset containers #96

Redesign package to be built on top of reusable dataset containers #96

darsnack commented Feb 19, 2022 •

edited

Loading

codecov-commenter commented Feb 19, 2022 •

edited

Loading

CarloLucibello commented Feb 19, 2022

CarloLucibello commented Feb 19, 2022

darsnack commented Feb 19, 2022

darsnack commented Feb 22, 2022 •

edited

Loading

CarloLucibello Feb 23, 2022

lorenzoh Feb 23, 2022

darsnack Feb 23, 2022

CarloLucibello Feb 23, 2022

darsnack Feb 23, 2022

lorenzoh left a comment

lorenzoh Feb 23, 2022

lorenzoh Feb 23, 2022

darsnack Feb 23, 2022

lorenzoh Feb 23, 2022

darsnack commented Feb 23, 2022

ablaom commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022

CarloLucibello commented Mar 5, 2022

darsnack commented Mar 5, 2022

CarloLucibello commented Mar 5, 2022

darsnack commented Mar 5, 2022

Redesign package to be built on top of reusable dataset containers #96

Redesign package to be built on top of reusable dataset containers #96

Conversation

darsnack commented Feb 19, 2022 • edited Loading

codecov-commenter commented Feb 19, 2022 • edited Loading

Codecov Report

CarloLucibello commented Feb 19, 2022

CarloLucibello commented Feb 19, 2022

darsnack commented Feb 19, 2022

darsnack commented Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorenzoh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darsnack commented Feb 23, 2022

ablaom commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022

darsnack commented Feb 24, 2022

ablaom commented Feb 24, 2022

CarloLucibello commented Mar 5, 2022

darsnack commented Mar 5, 2022

CarloLucibello commented Mar 5, 2022

darsnack commented Mar 5, 2022

darsnack commented Feb 19, 2022 •

edited

Loading

codecov-commenter commented Feb 19, 2022 •

edited

Loading

darsnack commented Feb 22, 2022 •

edited

Loading