Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random access to rows for some in-memory sources #48

Closed
ablaom opened this issue Nov 30, 2018 · 5 comments
Closed

Random access to rows for some in-memory sources #48

ablaom opened this issue Nov 30, 2018 · 5 comments

Comments

@ablaom
Copy link
Contributor

ablaom commented Nov 30, 2018

Feature request

I am not too familiar with Tables.jl but understand that if I am interacting with an in-memory data source - a DataFrame, say - then I do not have random access to rows of my frame through the interface. Rather, I can request a row iterator. If I only want a selection of rows, then (using the Tables.jl interface) I must iterate through all the rows until I have fetched the ones I need, no?

Would it be possible for Tables.jl to provided random access row accessor method (i.e., getindex) method that is performant for those data sources that actually support random access (DataFrames, the dense and sparse JuliaDB formats, TypedTables, TimeSeries, etc) with a slower fallback for out-of-memory or distributed sources?

Use case

I am involved in the development of MLJ.jl, a Julia machine learning toolbox for model tuning, composing, etc. Ideally we would like our toolbox to be somewhat data container agnostic. The main use cases would be data loaded into memory, in which case we want fast random row-access to make tuning by cross-validation performant. We could limit ourselves to in-memory data sources and convert all such data (using IteratibleTables, say) into a DataFrame (say). It seems to me however, that a superior solution would be to leave data sources alone and interact with them through Tables.jl, which would also allow us to interact with out-of-memory or distributed sources, at an albeit less than optimal way?

@tpapp
Copy link
Contributor

tpapp commented Nov 30, 2018

I am not sure an interop standard is the best place to do this. If both the source and destination support the Tables.jl interface, you can always read the data into a representation that supports random access (via some other API).

@davidanthoff
Copy link
Collaborator

TableTraits.jl provides that functionality as an optional interface. There is no source right now that implements it, but it would really be just a handful of lines of code to add this to a source. For some of the plans we have with Query.jl we also need this functionality.

@quinnj and I will try sometime later this year to see if we can somehow sort out the relationship between Tables.jl and TableTraits.jl, and if we pull that off I would assume that this would be part of that.

@quinnj
Copy link
Member

quinnj commented Feb 2, 2019

@ablaom, sorry for the slow response here; I'm trying to catch up on things after taking a bit of a break. I know we've talked a bit elsewhere (discourse, the MLJ repos), so I'm not sure how things have evolved since you originally opened this issue, but here are a few thoughts.

  • I'm adding first-class AbstractMatrix integration in this PR: Add MatrixTable type to wrap a Matrix and provide Tables.jl interface. #61. That allows seamless to/from support for any Tables.jl implementor
  • We also have the materializer function to get back the same type that came in
  • In general, the strategy of Tables.jl is for "users" (including package developers), to use Tables.rows or Tables.columns as best fits their use-case. For most of the types you mentioned, they all just return the object itself when you call Tables.columns, because they already support the Columns interface (i.e. property-accessible objects of iterators). I mention this because it seems like in your case, you'd probably want to use the Matrix integration, or call Tables.columns on inputs and then use the resulting objects appropriately.

Anyway, happen to chat more on strategies and what can work best for you.

@quinnj quinnj closed this as completed Feb 2, 2019
@nalimilan
Copy link
Member

IIUC the use case here, what is needed is to ensure a table supports efficient indexing at arbitrary row indices, and if not convert the table to a format that fulfills this requirement. Neither Tables.rows nor Tables.columns allows doing this, and converting the table to a matrix would be wasteful in some cases where the original table already supports efficient indexing.

However, using Tables.columntable to get a tuple of vectors could work for in-memory sources:

  • If the source uses a layout similar to a DataFrame, TypedTable, etc., it will simply collect the columns in a named tuple, without making a copy. Vectors in that tuple can then be accessed quite efficiently in a type-stable way.
  • If the source uses a different layout (notably row-oriented storage like CSV), it will allocate new vectors which can then be used efficiently.

But AFAICT it won't work for out of memory storage like JuliaDB, as it will copy all data, which may not fit in memory. @ablaom Have you actually checked the performance of your code with out of memory JuliaDB? I would expect it to be relatively slow to access arbitrary rows which can be on different workers. So I'm not sure a generic fallback which wouldn't make any copy is a good idea in that case.

@ablaom
Copy link
Contributor Author

ablaom commented Feb 3, 2019

Discussion continued here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants