Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redesign the package #73

Closed
CarloLucibello opened this issue Jul 28, 2021 · 7 comments
Closed

redesign the package #73

CarloLucibello opened this issue Jul 28, 2021 · 7 comments
Milestone

Comments

@CarloLucibello
Copy link
Member

CarloLucibello commented Jul 28, 2021

I'd like to open a discussion on how we should move forward with implementing a getobs and nobs compliant api,
while possibly also simplifying the interface and the maintenance burden.

See FluxML/FastAI.jl#22

I think we should move away from the module-based approach and adopt a type-based one. Also could be convenient to have some lean type hierarchy.

Below is an initial proposal

AbstractDatasets

####### src/datasets.jl

abstract type AbstractDataset end
abstract type FileDataset <: AbstractDataset end
abstract type InMemoryDataset <: AbstractDataset end

MNIST Dataset

######  src/vision/mnist.jl
"""
docstring here, also exposing the internal fields of the struct for transparency
"""
struct MNIST <: InMemoryDataset
   x                              # alternative names: `features` or `inputs`
   targets                     #                               `labels` or j`y`
   num_classes           # optional
   
    function MNIST(path=nothing; split = :train)  # split could be made a mandatory keywork arg
          @assert split in [:train, :test]
           ..........
   end
end

LearnBase.getobs(data::MNIST) = (data.x, data.target) 
LearnBase.getobs(data::MNIST, idx) = (data.x[:,idx], data.target[idx]) 
LearnBase.nobs(data::MNIST) = length(data.taget)

.... other stuff ....

Usage

using MLDasets: MNIST
using Flux.
train_data = MNIST(split = :train)
test_data = MNIST(split =:test)

xtrain, ytrain = getobs(train_data)
xtrain, ytrain = train_data # we can add this for convenience

xs, ys = getobs(train_data, 1:10)
xs, ys = train_data[1:10] # we can add this for convenience


train_loader = DataLoader(train_data; batch_size=128)

Transforms

Do we need transformations as part of the datasets?
This is a possible interface that assumes the transform to operate on whatever is returned by getobs

getobs(data::MNIST, idx) = data.transform(data.x[:,idx], data.y[idx])

Data(split = :train, transform = (x, y) -> (random_crop(x), y)  

Deprecation Path 1

We can create a deprecation path for the code

using MLDataset: MNIST
xtrain, ytrain = MNIST.traindata(...)

by implementing

function getproperty(data::MNIST, s::Symbol)
  if s == :traindata
    @warn "deprecated method"
    return ....

  ....
end

Deprecation Path 2

The pattern

using MLDataset.MNIST: traindata
xtrain, ytrain = traindata(...)

instead is more problematic, because assumes a module MNIST exists, but this (deprecated) module would collide with the struct MNIST. A workaround is to call the new struct MNISTDataset, although I'm not super happy with this long name

cc @johnnychen94 @darsnack @lorenzoh

@darsnack
Copy link
Member

darsnack commented Jul 28, 2021

Abstract datasets

Here, I was thinking something slightly different. If you look at FastAI.jl's containers, you see that FileDataset, TableDataset, etc. are generic containers for data that is stored in a certain format. So, instead of these being abstract types, they would be concrete structs that implement loading from disk, HD5, CSV, etc. This provides a low-level set of containers that anyone can use to load the data they have without needing to implement a custom getobs/nobs.

MNIST datasets

These datasets would now just wrap the low-level containers. You could envision something like the images are stored in directories and the labels are in a CSV. MNIST would then wrap a FileDataset and a TableDataset. getobs(MNIST, i) = getobs(images, i), getobs(labels, i).

Transforms

I think these can go in MLDataPattern.jl unless they are very specifically disk IO related.

Deprecations

This is definitely going to be tricky. My preference right now is to support as many paths as possible but allow a few hard breakages. Or if we want to be really conservative, then we can do MNISTDataset for now and deprecate that name on the following release. But I think that will be too confusing. Though MNISTDataset isn't the worst name either.

@johnnychen94
Copy link
Member

johnnychen94 commented Jul 31, 2021

I see the goal here is to 1) reduce code duplication, and most importantly 2) introduce a Dataset concept and maybe some predefined types (e.g., ImageFolder) like how PyTorch does.

Here are some of my very brief thoughts:

  1. Concrete datasets, e.g., MNIST, should be instances of some dataset type, instead of subtypes.
  2. We should consider online datasets, e.g., stream models (via Iterators), remote storage models (via lazy arrays).
  3. Datasets with transformations are just lazy arrays.
  4. Datasets are another level of abstraction on MLDataPattern's iterator model and array model. For simplicity, I prefer to hide getobs and other MLDataPattern details from the end users; the dataset type should either support the Julia Array interface or the Iterator interface.
  5. About deprecation, we can unconditionally move all codes to src/legacy/v0, and unconditionally deprecate all current symbols by pointing them to MLDatasets.legacy.v0. For example, MLDatasets.MNIST.traindata to MLDatasets.legacy.v0.MNIST.traindata. (Of course, we could maybe save us some efforts by starting a new project)
  6. DataDeps are good tools for offline datasets, but I want to take advantage of the more configurable Data.toml from Datasets.jl (Another dataset...)
  7. About stream models, we can try to support one online dataset platform API as an example. (Any suggestions?)

I'm working on a draft version for more detailed discussion, in the meantime, I just created a jc/v1 branch as a staging branch.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Mar 21, 2022

Coming here from #98 since there are some design discussions to do.
Let's focus on the Iris dataset, which one can see as a "table" dataset, although we currently don't.

On Master

Iris is a module and the interface we have is

X = Iris.features()  # a matrix
y = Iris.labels()    # a vector

Column names are missing but they are easy to obtain.

On PR#98

Now Iris is a type <: AbstractDataset that internally stores the X and y matrices.

data = Iris()

X, y = data.features, data.targets       # a matrix and a vector
X, y = getobs(data)                               # alternative syntax
X, y = data[]                                         # alternative syntax

X, y = data[1:10]                                   # a matrix and a vector
X, y = getobs(data, 1:10)                      # alternative syntax

Suggestion#1

Per @darsnack suggestion in #98 (review) we should compose the datasets with the recently added containers from #96, so Iris should wrap a TableContainer that should wrap a DataFrame.

A few comments on this approach:

  • One possible advantage is that one would avoid direct read from disk and use TableContainer instead for that, but I doubt that we can provide something better than what CSV/DataFrame already give.
  • I prefer simple and transparent data structures to unnecessarily complicated ones. What do we gain from all these layers of composition (data.container.df)?

Also, in this approach we obtain dataframe rows when indexing:

rows = data[1:10]                                   # a SubDataFrame

Maybe we can tweak it into giving

Xrows, yrows = data[1:10]                                   # 2 SubDataFrames 

Is this what we want? I think we can serve two audiences, those doing the data wrangling using dataframes, and those that just want to work with arrays. Maybe the first can just grab the df while we can reserve the indexing/getobs syntax for the latter? This doesn't need to be true for every table datasets, we could also decide case by case. Also returning dataframes when indexing is fine though, it is actually more consistent.

Suggestion#2

Regarding having too much composition, can we find something in between? Like making TableDataset an abstract type or a collection of methods, and having Iris store directly the underlying dataframe, data.df.

@darsnack
Copy link
Member

darsnack commented Apr 2, 2022

Looking over this, I agree that we don't want data.container.df in the end. There is really little reason not to just use the Matrix or Vector directly.

My reason for wanting to use TableDataset is so that we don't need to rewrite the file access/loading from disk code over. Admittedly, this is simple for us (i.e. maintainers) on Iris. But the point of this issue is to redesign the package so that we have an API that we can robustly use for own cases, and beginners could use to load new, custom datasets.

I added CachedDataset so that the lazy file based containers like TableDataset or FileDataset can be used to handle the reading + loading while interacting with a Matrix, etc. in-memory. CachedDataset allows the cache to be smaller than the dataset to have partial in-memory accesses. What if we add a fast path so that CachedDataset(..., 1:numobs(...)) returns the underlying cached Matrix etc. directly? This would basically remove all the wrappers when it's possible.

I see two things that need to happen then for the Iris case (assuming we use this approach):

  1. TableDataset needs to be able to group columns together. So Iris becomes a (TableDataset, TableDataset) where the first one selects the feature columns and the second selects the label column.
  2. CachedDataset has the fast path as described above.

Then we have Iris(...) = CachedDataset((TableDataset(...), TableDataset(...)) which returns a Tuple{Matrix{...}, Vector{...}} in the end.

@CarloLucibello
Copy link
Member Author

CarloLucibello commented Apr 3, 2022

That seems a bit opaque, I think newcomers would be more confused than helped. Also, reading a csv requires
a lot of customization (headers, comments, delimiters, types), I think it is just easier to interface directly with CSV.jl for that. Anyways, these are implementation details, if we manage to make adding new datasets easy, flexible and consistent that's good, but for now I will be happy enough if we converge to some user-facing interface for in-memory tabular datasets.

I looked into sklearn datasets for inspiration, e.g. see here.
They choose between dataframe and array representation at construction time with the as_frame keyword:

from sklearn import datasets

# datasets are dictionary-like objects of type `Bunch` where keys can be accessed as fields 
# as_frame=True is the defaults for most datasets
d = datasets.fetch_openml("iris",  as_frame=True) 
df_X = d.data        # a DataFrame with input features 
df_y = d.target      # a DataFrame with targets (a pd.Series actually when a single target) 
df_Xy = d.frame   #  a DataFrame with  both features and targets 

# directly return features and targets dataframes
df_X, df_Y = datasets.fetch_openml("iris",  as_frame=True, return_X_y = True)  

# Store arrays instead of dataframes
d = datasets.fetch_openml("iris",  as_frame=False) 
X = d.data        # numpy array
y = d.target      # numpy array

# directly return features and targets arrays
X, y = datasets.fetch_openml("iris",  as_frame=False, return_X_y = True)  

I think we can have something very similar:

d = Iris(as_frame = true)
df_X = d.features # dataframe
df_y = d.targets # dataframe
df =  d.df   # dataframe
df_X, df_y = d[]  # return underlying dataframes,  # Iris(as_frame = true)[]
df_X, df_y = d[i]  # two rows or two sub-dataframes

d = Iris(as_frame = false)
X = d.features # array
y = d.targets # array
X, y = d[]  # return underlying arrays, # Iris(as_frame = false)[]
X, y = d[i]  

@darsnack
Copy link
Member

darsnack commented Apr 3, 2022

Anyways, these are implementation details, if we manage to make adding new datasets easy, flexible and consistent that's good, but for now I will be happy enough if we converge to some user-facing interface for in-memory tabular datasets.

Sure, if that's the goal in #98, then I prefer we go with what we already have in that PR: the datasets load directly to arrays. We can consider the discussion here about allowing dataframes/tables/etc. as a feature for later.


The rest here is just a continuation of the discussion, not concerns that block #98. I think we have two issues here.

First is that a CSV file can be interpreted in two ways: as a table dataset or as an array dataset. If we consider CSV.jl, it doesn't allow an Array as a sink, and loading into a DataFrame then converting the underlying data to an Matrix is a common pattern that I've used for data like Iris.

Second is being able to reuse code for in-memory and lazy data loading. I agree that the CachedDataset approach above seems too magical, and therefore, confusing.

Maybe the least confusing thing is to avoid guessing or death by keywords and to use a more declarative syntax.

  1. TableDataset(<csv file>) always returns a table (same enforcement as CSV.jl). Like you mentioned, if someone needs special handling for headers, etc., they can just use TableDataset(CSV(...)) to use CSV.jl directly.
  2. Going from "multiple columns -> elements of a vector," like Iris does for each row, is a transform applied to the table. We can do MLUtils.mapobs(GroupCols(Array, ...), table_container).
  3. Steps 1-2 give use a clear interface for loading data, but it is still lazy and involves lots of wrappers. I realized we don't need some magic with CachedDataset to get the full in-memory dataset. MLUtils already provides a function that does this: getobs(data). Calling getobs on the output of Steps 1-2 will give the full table as a Matrix.

It still seems complicated though. With JuliaML/MLUtils.jl#61, this might only be required for tables that need to be lazy (i.e. too large for memory).

@CarloLucibello CarloLucibello added this to the v0.7 milestone Apr 6, 2022
This was referenced Apr 16, 2022
This was referenced Apr 28, 2022
@CarloLucibello
Copy link
Member Author

the redesign has been largely done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants