-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start an idea of what an "in memory" requirement would look like #278
base: main
Are you sure you want to change the base?
Conversation
This has been requested/discussed a number of times; here's one idea of what this could look like. Basically the same as `Tables.rows`, but `Tables.indexablerows` would require an "indexable" object of rows to be returned instead of just an iterator. Indexable is a little vague; to be most useful, we should probably require the return object to be `AbstractVector` since we get lots of fancy indexing/useful behavior that way. The bare minimum indexing interface is just `getindex`, `firstindex`, and `lastindex`, but it seems like people would then just be wanting to do `x[[i, j, k]]` like operations and have to implement their own. So I'm inclined to make the requirement that you have to return an `AbstractVector` of rows.
Couple of additional thoughts:
|
Collecting a rowtable from an iterator can be a costly operation and I'd want to know when I'm doing it, so I'd rather get an error. |
As commented on Discourse I think we should require What I mean is that if you subtype I would also require However, if @juliohm and @ablaom will find it acceptable to just require CC @nalimilan |
Thanks for working on this @quinnj! I agree with @bkamins that it doesn't seem necessary/useful to require returned objects to subtype
Given that currently the |
Thank you all for putting this proposal together. I would like to double check with you if I am understanding things correctly. Consider the following abstract https://github.com/JuliaGeometry/Meshes.jl/blob/master/src/traits/data.jl This abstract type implements the Tables.jl interface and the main idea there is the following: if a user of the ecosystem has data that can be mapped to a geospatial domain like a mesh, image, geographic map, he/she can simply inherit from the abstract On the other hand, when I call |
At least this was my original proposal. Your table does not have to be |
@nalimilan, @bkamins, the problem IMO is that I think it's going to be too hard/complex to expect users to correctly implement all the possible indexing operations. Maybe I'm wrong here and it's not actually too hard, but from a brief glance, they'd have to implement |
I guess part of my hesitation there is the "incompleteness" of the full indexing interface; I understand the "basic" indexing interface, but that only allows single |
# fallback for indexablerows if not overloaded explicitly | ||
function indexablerows(x::T) where {T} | ||
y = rows(x) | ||
if y isa AbstractArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why AbstractArray
here not AbstractVector
?
That was my point above. This is doable, but it is super hard. The users would need to additionally handle things like: In my experience if someone would implement a non- In summary - while we do not have to expect BTW - we should implement a proper method for |
Yes, clearly it's simpler for table implementations to subtype
@bkamins Actually I think that if we require an Anyway, since it seems that @juliohm would actually be OK with having |
This was my original proposal and I am convinced to it if we are sure we are OK with requiring In summary I will re-state my question from Discourse:
My 2cents when having a distinction could make sense (but I am not even sure if currently such table type exists so maybe it is only a hypothetical scenario). Assume you have a table type that can serve rows via an iterator very cheaply. It potentially also can serve rows via indexable object but with some more overhead (but less than |
Could we add a |
There are already Also note that what you ask for is already defined as a part of iteration protocol. If iterator does not have a defined size it should return From my perspective the point of this discussion (and this was reflected in my original proposal on Discourse) is not to define new concepts/types/functions for things that already covered in the data ecosystem as if we do it then it becomes increasingly difficult to:
What I would propose is to have the following priority:
I think it would be great to add packages from MLJ.jl into this list (to be explicit at what level of priority they fall). @ablaom - would it be only MLUtils.jl or there are also additional interface packages we should add? The point is that if e.g. we want to change something in Tables.jl the maintainers responsibility is to make sure we do not duplicate something that is already defined in Base Julia or DataAPI.jl or StatsAPI.jl (this is exactly the case of When I said "we would need another function" I meant that even having
is an iterator, it has length (10), it has shape, but it does not (and cannot) support random access. |
AFAICT we don't need to require the result of |
@bkamins Yes, I agree the suggestion is a poor one after all.
I do think it would be good to add MLUtils.jl to the mix. (It is not part of MLJ, however, and I have little connection to it.) DataLoaders.jl, a popular tool for handling out-of-memory datasets in the deep learning community, is based on the MLUtils.jl data container API (previously in LearnBase.jl). I'm thinking that if we can get MLUtils to play nicely with tables then MLJ to could carry out observation resampling through that interface, rather than through a mixture of the Tables and AbstractArray interface, as currently. This would be more natural, allow us to jettison the MLJ row-indexing methods for tables, and be a way for some MLJ models to support training on data that does not fit into RAM. |
A test suite (living in a dedicated package) could help with this. (This is just a general point, somewhat orthogonal to this discussion though).
It all depends on what it would be used for. If the users would expect to use it as vectors, then it is the right choice IMO. |
@quinnj + @nalimilan I have chatted with @ablaom about MLJ.jl requirements for Tables.jl interface. Here is a summary of what I think is crucial. @quinnj + @nalimilan => what do you think would be the best approach to support it (given the discussions we already had + maybe some additional new ideas are needed?). Thank you! Preliminary
Use case 1A typical case in ML is doing Cross Validation Use case 2Accessing rows of table in a random order (e.g. when processing data in randomly selected batches). "Use case 1" is a special case of "use case 2" but maybe it could support a more set of more efficient methods. |
Thanks for the detailed summary. So it looks like the interface defined here would satisfy these use cases. What's missing is that the fallback implementation should be changed a bit and the docstring be more explicit about some guarantees:
|
I think it is enough that the same "mode" of the table is returned, i.e. column or row oriented. In particular maybe we should also explicitly document that @ablaom - can you please confirm that what @nalimilan described + my comment would fulfill your needs on high-level (then we should finalize the design in detail and submit it to another check with you). |
Thanks @bkamins and @nalimilan . Yes, I think what is described would fit my own needs.
I agree that returning a table of the same type doesn't need to be a hard requirement. For a start, this seems inconsistent with @bkamins sensible suggestion that views are preferred, and views generally don't have the same type as the original table.
This is probably a question for the designers. But I'm curious what precisely |
Well, no, you'd call
Here's one. :-) https://github.com/JuliaData/Tables.jl/pull/278/files#diff-12abce6071269cefb726c033c3832f2e0326e3a2fe769e67323a1b5ac0772483R111 |
Perhaps there is a misunderstanding. I mean, if I'm in use case 2 and I want fast row indexing, then I calll
Ha ha. You are just pointing to an error thrown by the current proposal. I mean, if instead of throwing an error, the fallback just calls I'm just thinking aloud here and of course happy for you to settle this detail. |
I see. But If we want to support different performance variants, maybe that should be done via an argument to
My point was that calling |
Sorry, I don't understand: julia> eachrow(df) === Tables.rows(df)
true Maybe you mean But, yes, an option to get the fastest row-indexable object would be nice. I couldn't say if it is necessary without doing a lot of benchmarking. Is this the sort of thing you'd like to see before we proceed? |
I just said "not faster", because |
As a side comment: In general |
Could we get something like this merged before the start of GSoC projects? I know some students are interested in building a panel data interface, and I think that gets significantly easier if you can do basic Split/Apply/Combine methods on Tables. |
Ok, jumping back into things here; sorry for the delay; personal life has been a bit crazy lately.
Hmmm, I didn't come to this conclusion reading @bkamins summary here. To me, it sounds like it's just desirable to have a This is yet another reason why requiring the result of That said, I don't feel strongly about requiring the subtyping, so I'm fine leaving that door open for now, though strongly recommending it and noting the minimum required methods to be implemented (which would be more than the currently defined Indexable interface in Base, which only requires single-int getindex). The final point I'm still considering is whether we throw an error by default, or call
What should be avoided, is repeatedly calling If that all sounds good, I'll clean this PR up: add some tests, lots of doc details, and we can merge in the next few days. |
IIRC I derived this part from the sentence "Here it should be possible to get a CV fold that should preserve the access type (column/row) of the original table." But I'm not sure why MLJ needs this. At least it seems that the exact type could be different as long as orientation is preserved. |
What I think is needed is if you select multiple rows is that efficient access style (rows or columns) does not change. I think that requiring type not to change would be too strict. The point is that if e.g. the source table had fast column access then after
(and the same for rows) |
Hmmm, this seems a little paradoxical; i.e. you're calling |
@ablaom - can you confirm? To my understanding the requirement is that if the original object allowed you to do a fast extraction of a column of a table then the subsetted object should retain this property (it does not have to be the same type). |
Yes, this is what I have been looking for, to address use-case 1. However, I am realising a little late that the "fast extraction of columns" property is not the same thing as "columnaccess(X) == true" (and similarly for "rowaccess"). And my confusion extends to my conceptualisation of the ML workflows which are motivating my request. 😢 . Over the next couple of days I will rethink this and hopefully clarify our needs better. |
Thanks to all for your continued patience, and for continuing to check in with my requirements. After reflecting on my confusion, I should like to tweak my requirements for Use Case 1. When I subsample the rows of A. Here's is more on the rational for requirement A. When a data scientist chooses a table type, she chooses it for a reason. That is, the table type reflects some desired characteristics such as:
Very often you don't want observation subsampling - a very basic operation in ML - to break those characteristics. That is, preserving table type takes precedence over cost of the operation (which, compared to training models, is not usually an issue). Note that in MLJ, cross-validation is automated. So, at present there is no interface for "correcting" the type of a table whose model-specific requirements are being broken by row subsampling. In TableTransforms.jl (which is more about columns than rows) the design choice is to always return the output using Generally views of a table change the type (and hence properties of the table we might want to preserve). However, I think it is useful to expose them and suggest we add an option to specify B. ( I might be wrong, but it seems to the current PR is not able to address Use Case 1, because it does not sample rows directly but creates a new representation for which row subsampling has certain properties. So I see this PR as about meeting Use Case 2 (and Use Case 1 is not just a special case of Use Case 2). What is wrong with adding a As I say, the current PR seems more about Use Case 2. For that I'm not sure I understand the need for extension of the Tables API. As @bkamins has suggested earlier, I can test if Number of rows. In both use cases, one needs to know the number of rows before executing the subsampling. Can we have some way for the user to determine table "finiteness"? There seem to be three cases:
|
Couple comments:
Based on additional comments, I'm wondering if we want a different kind of interface all together here. Let me stream-of-conscious talk out loud here and see if it ends up making sense. The "problem" w/ the current PR is that in the case of
function subset(x, I::Vector{Int})
if columnaccess(x)
return ColumnSubset(x, I)
else
return x[I]
end
end So for rows, we'd fallback to For column sources, we'd have a helper object defined like: struct ColumnSubset{X}
x::X
inds::Vector{Int}
end
Tables.istable(::Type{<:ColumnSubset}) = true
Tables.columnaccess(::Type{<:ColumnSubset}) = true
Tables.columnnames(x::ColumnSubset) = columnnames(x.x)
Tables.getcolumn(x::ColumnSubset, i) = getcolumn(x.x, i)[x.inds] So here we see that for any column oriented source, Let me know what you all think and if I've missed anything obvious (probably have, it's a bit late here for me and as I mentioned, this was a live, coding "writeup" as I thought through this). |
Thanks @quinnj
As I detail in my previous comment, I care about more preserving the properties of the original table than about efficiency of the subsetting operation (in Use Case 1). My performance bottlenecks are downstream of the subsetting. If others view my requirements as too niche, then a table with the same access type is preferred. Ideally there would be an option to ensure this access (column or row) is fast (with a copy if necessary).
Yes, my bad 🙄 |
I feel that:
|
If "the same type as original" requirement is understood in a somewhat relaxed way, then this is already fullfilled by lots of in-memory Maybe, other |
Ok, so correct me if I'm wrong, but it sounds like my latest proposal hits the right points that have been brought up; namely, the plan is:
So a source could overload to provide the same type in the subset operation, but the generic fallback would at least preserve access type. My one qualm/question is what to do about the 2nd argument for I guess I'm mainly wondering if there are enough variations there that we're not comfortable with making the API complex for implementations that we just keep it simple for now. @bkamins, what you do think on this point? |
You have hit an excellent spot here as I have spent much time thinking about it when designing DataFrames.jl 😄. This is what I think:
Regarding |
Just to clarify, these names come from MLUtils.jl (originally LearnBase.jl) which presently has nothing at all to do with MLJ. (MLJ does have a method called
Mimicking the Base terminology sounds like a good idea, for the reasons you state. Lot easier for me to remember too. |
@bkamins' proposal of a |
I think |
It's a fair point (NamedArrays come to mind) but I personally think this is okay because the proposal is to implement |
Maybe it would be better to use explicit names rather than mimicking the Base terminology with Note that we already have |
After some discussion with @nalimilan I propose the following:
|
@quinnj - would you have time to have a look at this PR and finalize it (we can then follow it up in DataFrames.jl 1.4 release that we plan to have soon to have packages in sync). Thank you! |
Can we just start by defining the function |
This has been requested/discussed a number of times; here's one idea of
what this could look like. Basically the same as
Tables.rows
, butTables.indexablerows
would require an "indexable" object of rows to bereturned instead of just an iterator. Indexable is a little vague; to be
most useful, we should probably require the return object to be
AbstractVector
since we get lots of fancy indexing/useful behaviorthat way. The bare minimum indexing interface is just
getindex
,firstindex
, andlastindex
, but it seems like people would then justbe wanting to do
x[[i, j, k]]
like operations and have to implementtheir own. So I'm inclined to make the requirement that you have to
return an
AbstractVector
of rows.