Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can MLUtils play nicely with Tables.jl? #61

Closed
ablaom opened this issue Feb 23, 2022 · 26 comments · Fixed by #124
Closed

Can MLUtils play nicely with Tables.jl? #61

ablaom opened this issue Feb 23, 2022 · 26 comments · Fixed by #124

Comments

@ablaom
Copy link

ablaom commented Feb 23, 2022

I think one could get greatly increase buy-in for MLUtil.jl if every Tables.jl compatible table would automatically implement the "data container" API. To get performance, one would still want to implement the concrete table types as well, but having it "just work" for all tables would be nice. I guess, since "table" is itself just an interface, rather than an abstract type, this would need to be implemented as part of the data container API, right? As Tables.jl is very lightweight, I don't see that as a big issue (and I could probably find someone to help with the integration).

Even so, there seems to be a problem implementing the interface for certain tables. MLUtils.jl interprets tuples in a very specific way. For example shuffleobs((x1, x2)) treats x1 and x2 as separate data containers, which are to be shuffled simultaneously, with the same base observation index shuffle. But some tables are tuples. The following example is even a tuple-table whose elements are themselves tables (of a different type):

julia> X
((a = [1, 3], b = [2, 3]), (a = [2, 5], b = [4, 7]))

julia> Tables.istable(X)
true

So is such a tuple a pair of data containers or a single data container? The current API cannot distinguish them.

I wonder:

  1. How attached are people to current tuple-based dispatch for coupled multi-container processing?
  2. Is there a big use-case for tables that are also tuples? @quinnj

Possibly this discussion is related.

Tables that are tuples are problematic elsewhere.

@oxinabox @rikhuijzer @darsnack

@ablaom
Copy link
Author

ablaom commented Feb 23, 2022

@OkonSamuel

@darsnack
Copy link
Member

JuliaML/MLDatasets.jl#96 adds data containers that are MLUtils compatible including TableDataset which wraps any Tables.jl compatible table. Tables.jl is just an interface, so we need a wrapper type to dispatch correctly. This way even a tuple that's intended to be a table can be treated as such.

Perhaps TableDataset is more fundamental and should be moved to MLUtils directly.

@ablaom
Copy link
Author

ablaom commented Feb 23, 2022

@darsnack Thanks! That's good to know.

Perhaps TableDataset is more fundamental and should be moved to MLUtils directly.

Makes sense to me.

@ablaom
Copy link
Author

ablaom commented Feb 24, 2022

What are the downsides of using trait-based dispatch to avoid the wrapping, apart from meaning Tables.jl has to be a dependency of BaseLearn.jl (or whatever replaces that)? That is, assuming the tuples issue is resolved another way.

@darsnack
Copy link
Member

You mean using a traits package that turns istable into a dispatch instead of a run time check?

@ablaom
Copy link
Author

ablaom commented Feb 24, 2022

Mmm, on further reflection, I guess there's no way to do this without breaking the API for current implementations, right?

@darsnack
Copy link
Member

Depends on what you're thinking. With SimpleTraits.jl that's not the case though I haven't checked whether @traitimpl IsTable{X} <- istable(X) will incur no overhead.

Breaking changes to the API is not an issue, but I'd like to avoid needing to call istable on everything by default (also a bit arbitrary to special case Tables.jl). And I think having tuples/named tuples follow the current behavior is more desirable. A table might be a named tuple but not all named tuples are tables. I think for a Base type, it makes more sense for the user to specify a special meaning.

@darsnack
Copy link
Member

I guess the other way of thinking about this is what is an example where something is Tables.jl compatible, but wrapping it in something like TableDataset is not desirable/possible? For TableDataset, we can make all the standard Tables.jl operations pass through to the underlying table so that the wrapper is nearly invisible beyond construction.

@ablaom
Copy link
Author

ablaom commented Feb 24, 2022

I think these are fair arguments. And, by the way, I do think the TableDataset work is a cool and valuable contribution (which should probably not be hidden away in MLDatasets).

However, I believe the run-of-the mill statistics user will find wrapping tables off-putting, to say the least.
Therefore, in the "tables-must-be-wrapped" proposal, it becomes the responsibility of every model providing package or ML tooling package to wrap and unwrap tables, yes? But then such a package needs to test whether or not to do so (unless it only deals with tables) essentially duplicating the trait-based dispatch which I am proposing goes into this base ML API interface in the first place, no?

@ToucheSir
Copy link
Contributor

I'm not sure why shuffleobs isn't a varargs function (historical reasons?), but the big issue with auto-wrapping Tables.jl-compatible sources is this. In an absence of an efficient random access API to rely on, our best option is to rely on users implementing getobs on their own dataset types. Always wrapping would mean consistently poor performance.

@ablaom
Copy link
Author

ablaom commented Feb 24, 2022

Good point I hadn't thought of. By always wrapping, you hide the original table type, for which there might be efficient random access for which you could otherwise specialise on . Instead you always have to go with the random access through the Tables.jl API which is generally bad.

@darsnack
Copy link
Member

Therefore, in the "tables-must-be-wrapped" proposal, it becomes the responsibility of every model providing package or ML tooling package to wrap and unwrap tables, yes? But then such a package needs to test whether or not to do so (unless it only deals with tables) essentially duplicating the trait-based dispatch which I am proposing goes into this base ML API interface in the first place, no?

Yeah that's a good point. I will see if there's an overhead to a trait-based approach and file a PR if not. Note that tuples would still be excluded from this path.

@ablaom
Copy link
Author

ablaom commented Feb 24, 2022

I will see if there's an overhead to a trait-based approach and file a PR if not.

Thanks indeed for consdering this. Will this addresses @ToucheSir 's important point? That is, would getobs(my_table) always dispatch the inefficient tables fallback, even if an efficient getobs(::MyTable) is implemented? Perhaps we also we need a isdatacontainer trait (analogous to istable) to resolve this issue. Is there something like this?

I still feel the current interpretation of tuples in MLUtils, as way to dispatch special behaviour (and inherited from MLDataPattern) is intrinsically flawed. This isn't just about tables. Why should we exclude implementation of the interface to objects that happen to be tuples (or any other data type)? What about @ToucheSir 's suggestion to use varargs, which "only" breaks use of shuffle and its cousins in the multi-container case, as far as I can tell. Are there other reasons for not using varargs?

@darsnack
Copy link
Member

That is, would getobs(my_table) always dispatch the inefficient tables fallback, even if an efficient getobs(::MyTable) is implemented?

I will double check, but if getobs(::MyTable) exists, then it would take precedence over getobs(::X) where {X, IsTable{X}}. There is still the question of where these fast implementations live. Hopefully, DataFrames, etc. will be happy to accept a PR adding these definitions.

Why should we exclude implementation of the interface to objects that happen to be tuples (or any other data type)?

Unfortunately, Julia forces a one-or-none situation here. Let's say MLUtils had no special path for tuples, but we did have a trait-based path for Tables.jl tables. Then tuples will automatically take that path. If some other Foo.jl interface that's not Tables.jl is also compatible with tuples, then there is no way for tuples to selectively participate as a Tables.jl table or a Foo.jl object.

My point is that treating tuples as tables is the same as treating them the way we do now. The only difference is what interpretation is given favoritism. So the alternative to what we have now is that tuples have no meaning attached to them, or we decide which interpretation is the "golden" one.

as way to dispatch special behaviour

At face value, it looks like the tuple decision is about dispatching the inputs to a function, but it is really a consequence of the outputs. In Julia, there is a single, canonical way to return multiple values, and that is a tuple.

If I do sample = getobs(A, i) for some array A, then I can apply the whole MLUtils pipeline (shuffleobs, etc.) on sample. This is a feature of the package—that I can just write out the pipeline interleaving data processing, transformation, and iteration utilities. For a container that returns multiple parts in each sample, sample will be a tuple (e.g. HDF5 datasets where various inputs and labels are stored under different HDF5 directories).

The alternative for MLUtils is to create some GroupedData which would be fine to use for data sources, but cumbersome to enforce at intermediate stages in the pipeline (every type should need to wrap every multi-return value as a GroupedData?). This would basically be MLUtils creating its own version of tuple.

So, the decision that a group of "things" is a tuple is really the interpretation given by Julia's return behavior. Maybe no package should take ownership of tuple in its interface, and MLUtils should do some GroupedData thing. But I really do think we are adhering quite closely to the interpretation implied by Julia. IMO, it's Tables.jl that is misusing the notion of a tuple by turning it into a table (of course, I lack experience...maybe there is a really good reason that I'm missing).

@darsnack
Copy link
Member

^long explanation above being said, I don't want to imply that I feel so strongly about this that MLUtils can't budge...the solution here is going to have to be a compromise somewhere and that can certainly be this package.

@darsnack
Copy link
Member

Another option is to enforce explicit splatting in the pipeline

@ablaom
Copy link
Author

ablaom commented Feb 27, 2022

So, the decision that a group of "things" is a tuple is really the interpretation given by Julia's return behavior.

Aha. Yes, I get this now. You want tuples for the pipelining.

Another option is to enforce explicit splatting in the pipeline

For the user, you mean? Sorry, I'm not actually that familiar with the pipelining here - is there special syntax or are you just composing functions?

^long explanation above being said, I don't want to imply that I feel so strongly about this that MLUtils can't budge...the solution here is going to have to be a compromise somewhere and that can certainly be this package.

Thanks for being open about this. I'm kind of playing devil's advocate here, and this more subtle than I realised at first. It may may be that wrapping tables that don't have native implementations of getobs is the safer path.

Mmm. What if we put logic in the wrapping operation that detects whether getobs is implemented, and if it is, we just return the original (unwrapped) object. This addresses @ToucheSir 's concern, right?

@darsnack
Copy link
Member

For the user, you mean? Sorry, I'm not actually that familiar with the pipelining here - is there special syntax or are you just composing functions?

Yeah we just use |> so this would mean asking the user to turn a tuple into a grouped container to pass to the next stage in the pipeline.

Mmm. What if we put logic in the wrapping operation that detects whether getobs is implemented, and if it is, we just return the original (unwrapped) object.

You mean that if someone tries to wrap a DataFrame as a TableDataset then we just pass through the DataFrame instead of wrapping it (similarly for anything that implements getobs directly)? We can certainly make that happen, but this would still require the user or downstream package to perform the wrapping operation? Just trying to understand where auto-wrapping would come into play here.

@ToucheSir
Copy link
Contributor

FYI: JuliaData/Tables.jl#278

@CarloLucibello
Copy link
Member

CarloLucibello commented Apr 3, 2022

I think it would be good to support Tables.jl's tables without any wrappers. For numobs we can change the generic fallback to

function numobs(data)
    if Tables.istable(data)
        return length(Tables.rows(data))
    else
        return length(data)
    end
end

getobs is more problematic since Tables.jl doesn't expose a getrow(table, i) interface. We could just implement the following, that works for DataFrames

function getobs(data, idx::Union{Integer, AbstractVector{<:Integer}})
    if Tables.istable(data)
        return data[idx, :]
    else
        return data[idx]
    end
end

At least this leaves us in a better position than where we currently stand.

Tuples and NameTuples current specializations should stay as they are instead.

@ablaom
Copy link
Author

ablaom commented Apr 3, 2022

@CarloLucibello (cc edited) Thanks for that. I agree it would be good to avoid wrapping. Summarizing discussions at JuliaData/Tables.jl#276 and on slack, I think there will be no change at Tables.jl to rule out tuple-tables, and I suggest MLUtils.jl continues to treat tuples as multi-containers. On further investigation, ruling out tuple tables for ML is probably fine, at least for now.

I did not realize that MLUtils treats named tuples in a special way. Can you say more about this? That could be a deal-breaker as named tuples of equi-length vectors (called column tables) is a fundamental native Julia table type used all over the place.

As for the your getobs proposal. I think we can improve on this and I suggest we wait and see how JuliaData/Tables.jl#278 resolves. I think there is momentum for progress there. Is there some urgency at your end? If so, I could have a bash at suggesting something better.

@darsnack
Copy link
Member

darsnack commented Apr 3, 2022

It's treated the same as tuple: a multi-container like (features = ..., targets = ...) will call getobs, etc. on each entry in the named tuple as if it were a container.

@ablaom
Copy link
Author

ablaom commented Apr 4, 2022

Oh no 😭 .

@ToucheSir
Copy link
Contributor

AFAICT MLUtils and Tables jl are consistent in treating a namedtuple of vectors or tuples as column tables, can you elaborate on your concern?

@ablaom
Copy link
Author

ablaom commented Apr 4, 2022

Oh, I see. I guess that means column tables can only be indexed as if they were multi-containers of vectors (or tuples) (in the MLUtils sense). Nothing that Tables.jl does about column tables (eg, some change to the implementation of Tables.rows) will have any effect on behaviour, no matter what we implement here for generic tables, but I guess that doesn't matter too much??

@CarloLucibello
Copy link
Member

It will be possible to solve this issue using Tables.getrows from JuliaData/Tables.jl#284 and DataAPI.nrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants