Random access to rows for some in-memory sources #48

ablaom · 2018-11-30T15:52:17Z

Feature request

I am not too familiar with Tables.jl but understand that if I am interacting with an in-memory data source - a DataFrame, say - then I do not have random access to rows of my frame through the interface. Rather, I can request a row iterator. If I only want a selection of rows, then (using the Tables.jl interface) I must iterate through all the rows until I have fetched the ones I need, no?

Would it be possible for Tables.jl to provided random access row accessor method (i.e., getindex) method that is performant for those data sources that actually support random access (DataFrames, the dense and sparse JuliaDB formats, TypedTables, TimeSeries, etc) with a slower fallback for out-of-memory or distributed sources?

Use case

I am involved in the development of MLJ.jl, a Julia machine learning toolbox for model tuning, composing, etc. Ideally we would like our toolbox to be somewhat data container agnostic. The main use cases would be data loaded into memory, in which case we want fast random row-access to make tuning by cross-validation performant. We could limit ourselves to in-memory data sources and convert all such data (using IteratibleTables, say) into a DataFrame (say). It seems to me however, that a superior solution would be to leave data sources alone and interact with them through Tables.jl, which would also allow us to interact with out-of-memory or distributed sources, at an albeit less than optimal way?

The text was updated successfully, but these errors were encountered:

tpapp · 2018-11-30T16:06:59Z

I am not sure an interop standard is the best place to do this. If both the source and destination support the Tables.jl interface, you can always read the data into a representation that supports random access (via some other API).

davidanthoff · 2018-11-30T17:29:10Z

TableTraits.jl provides that functionality as an optional interface. There is no source right now that implements it, but it would really be just a handful of lines of code to add this to a source. For some of the plans we have with Query.jl we also need this functionality.

@quinnj and I will try sometime later this year to see if we can somehow sort out the relationship between Tables.jl and TableTraits.jl, and if we pull that off I would assume that this would be part of that.

quinnj · 2019-02-02T08:24:32Z

@ablaom, sorry for the slow response here; I'm trying to catch up on things after taking a bit of a break. I know we've talked a bit elsewhere (discourse, the MLJ repos), so I'm not sure how things have evolved since you originally opened this issue, but here are a few thoughts.

I'm adding first-class AbstractMatrix integration in this PR: Add MatrixTable type to wrap a Matrix and provide Tables.jl interface. #61. That allows seamless to/from support for any Tables.jl implementor
We also have the materializer function to get back the same type that came in
In general, the strategy of Tables.jl is for "users" (including package developers), to use Tables.rows or Tables.columns as best fits their use-case. For most of the types you mentioned, they all just return the object itself when you call Tables.columns, because they already support the Columns interface (i.e. property-accessible objects of iterators). I mention this because it seems like in your case, you'd probably want to use the Matrix integration, or call Tables.columns on inputs and then use the resulting objects appropriately.

Anyway, happen to chat more on strategies and what can work best for you.

nalimilan · 2019-02-02T16:01:44Z

IIUC the use case here, what is needed is to ensure a table supports efficient indexing at arbitrary row indices, and if not convert the table to a format that fulfills this requirement. Neither Tables.rows nor Tables.columns allows doing this, and converting the table to a matrix would be wasteful in some cases where the original table already supports efficient indexing.

However, using Tables.columntable to get a tuple of vectors could work for in-memory sources:

If the source uses a layout similar to a DataFrame, TypedTable, etc., it will simply collect the columns in a named tuple, without making a copy. Vectors in that tuple can then be accessed quite efficiently in a type-stable way.
If the source uses a different layout (notably row-oriented storage like CSV), it will allocate new vectors which can then be used efficiently.

But AFAICT it won't work for out of memory storage like JuliaDB, as it will copy all data, which may not fit in memory. @ablaom Have you actually checked the performance of your code with out of memory JuliaDB? I would expect it to be relatively slow to access arbitrary rows which can be on different workers. So I'm not sure a generic fallback which wouldn't make any copy is a good idea in that case.

ablaom · 2019-02-03T20:10:17Z

Discussion continued here

ablaom mentioned this issue Nov 30, 2018

Agnostic data container proposal JuliaAI/MLJ.jl#15

Closed

quinnj closed this as completed Feb 2, 2019

nalimilan mentioned this issue Feb 4, 2019

Dimension convention and table <--> matrix conversions JuliaAI/MLJ.jl#50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random access to rows for some in-memory sources #48

Random access to rows for some in-memory sources #48

ablaom commented Nov 30, 2018

tpapp commented Nov 30, 2018

davidanthoff commented Nov 30, 2018

quinnj commented Feb 2, 2019

nalimilan commented Feb 2, 2019

ablaom commented Feb 3, 2019

Random access to rows for some in-memory sources #48

Random access to rows for some in-memory sources #48

Comments

ablaom commented Nov 30, 2018

tpapp commented Nov 30, 2018

davidanthoff commented Nov 30, 2018

quinnj commented Feb 2, 2019

nalimilan commented Feb 2, 2019

ablaom commented Feb 3, 2019