-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign package to be built on top of reusable dataset containers #96
Conversation
Codecov Report
@@ Coverage Diff @@
## master #96 +/- ##
==========================================
+ Coverage 31.10% 36.02% +4.91%
==========================================
Files 34 38 +4
Lines 807 880 +73
==========================================
+ Hits 251 317 +66
- Misses 556 563 +7
Continue to review full report at Codecov.
|
Given JuliaML/MLUtils.jl#56, should we avoid depending on MLUtils and just define |
We should drop julia < 1.6 |
The way I wrote If the MLUtils.jl dependency is too expensive, I'd rather move ahead with splitting the interface into its own base package. My view is that MLDatasets should be a first-class consumer of the MLUtils interface. I also think that MLUtils might be a cheaper dep than some of the other IO related packages. Though this is a good question to consider: should there be a "just the data deps" package that splits out only the download/unpack functions? |
e654cd3
to
b7ba9c4
Compare
I decided against incorporating text in this PR since it involves a temporal dimension that we haven't yet discussed as part of the MLUtils.jl refactor. I also expect that the best option will involve factoring out text datasets similar to how we discussed for vision. Would also be useful to get @lorenzoh's eyes on this PR. |
source::T | ||
cacheidx::Vector{Int} | ||
cache::S | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go in MLUtils.jl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make a PR though this is only useful for data that isn't already in memory. I had trouble thinking of cases where that's true but the data isn't a dataset.
src/containers/cacheddataset.jl
Outdated
end | ||
|
||
function CachedDataset(source, cachesize::Int = numobs(source)) | ||
cacheidx = 1:cachesize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storing the first chachesize entries seems totally arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any subset of the indices would be arbitrary. The main reason for including this is to address FFCV from this issue. My lab mate is using FFCV with PyTorch on ImageNet right now. We strongly suspect that the overwhelming majority of the performance gains are from caching a portion of the dataset in memory (my lab mate will request 400GB on our cluster). In this case, the particular indices don't matter so much and how many, so the first N seems as good as any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, mostly on the uglier bits I put into FastAI.jl.
Beside those things, I think we should standardize on the type we use to represent file paths, i.e. String
or AbstractPath
. Even if we want to allow passing both, I think it makes sense to use the same repr. internally.
In favor of String
s:
- bit simpler
- the globbing only returns files and I think converting these to
Path
s consumes more memory
In favor of AbstractPath
s:
- more thoughtful API
- more extensible (e.g. other types of file systems like S3)
- less ambiguous
What do you think?
source::T | ||
cacheidx::Vector{Int} | ||
cache::S | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think so
src/containers/filedataset.jl
Outdated
|
||
Load a file from disk into the appropriate format. | ||
""" | ||
function loadfile(file::String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if a bunch of nested if
s is the most elegant solution for this. Since this was meant to be a high-level function for FastAI.jl, maybe we can do without it for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just go with FileIO.load
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's the best for now. But let's re-export load
instead of creating a new loadfile
function
I went with |
@darsnack Do the wrapped |
Not at this moment, but based on our discussion it's clear that it should! I'll add this tomorrow. |
If not already, I suggest the wrapper be parameterised by the type of the table being wrapped. See continuation of same discussion. |
Yeah for exactly the reasons in the discussion, we already do this. And we have faster paths for when the underlying table is a |
I see. That works here, but not ideal if we were to move this to MLUtils (say) because we would need DataFrames and CSV as dependencies. Those |
Co-authored-by: lorenzoh <lorenz.ohly@gmail.com>
CI fails due to a missing |
Okay this is probably good to go now. |
You can remove |
I just added a WIP page so that the docstrings are at least there should someone choose to use the dev branch. I also changed the behavior to only check docstrings for exported functions (since I added |
This works towards #73 by adding low-level dataset containers that are MLUtils.jl compatible out of the box. The end goal is to rewrite the existing datasets to use these containers, but this PR will just add the containers without changing the existing functionality that users expect. In a followup PR, I will rewrite the existing API, provide a deprecation path, and add better documentation (docstrings will still be added here first).
So far, I have ported only the containers from FastAI.jl. I intend to implement the following before merging this PR:
Low level
FileDataset
: lazy access to a vector of files on diskTableDataset
: any Tables.jl compatible tableTableDataset
wraps Tables.jlHDF5Dataset
: built on HDF5.jlJLD2Dataset
: built on JLD2.jlthe data can just be read in as aMATDataset
: built on MAT.jlDict
and used with MLUtils.jl without any extra effort (there is no value to having a wrapper here)Mid level
CachedDataset
: wraps a lazy dataset likeFileDataset
so that it stays in memory (this is how small datasets like MNIST will be)MLUtils already has functionality for thisGroupedDataset
: wraps several named containers together (e.g. for "train" + "val" data)