-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random access to rows for some in-memory sources #48
Comments
I am not sure an interop standard is the best place to do this. If both the source and destination support the Tables.jl interface, you can always read the data into a representation that supports random access (via some other API). |
TableTraits.jl provides that functionality as an optional interface. There is no source right now that implements it, but it would really be just a handful of lines of code to add this to a source. For some of the plans we have with Query.jl we also need this functionality. @quinnj and I will try sometime later this year to see if we can somehow sort out the relationship between Tables.jl and TableTraits.jl, and if we pull that off I would assume that this would be part of that. |
@ablaom, sorry for the slow response here; I'm trying to catch up on things after taking a bit of a break. I know we've talked a bit elsewhere (discourse, the MLJ repos), so I'm not sure how things have evolved since you originally opened this issue, but here are a few thoughts.
Anyway, happen to chat more on strategies and what can work best for you. |
IIUC the use case here, what is needed is to ensure a table supports efficient indexing at arbitrary row indices, and if not convert the table to a format that fulfills this requirement. Neither However, using
But AFAICT it won't work for out of memory storage like JuliaDB, as it will copy all data, which may not fit in memory. @ablaom Have you actually checked the performance of your code with out of memory JuliaDB? I would expect it to be relatively slow to access arbitrary rows which can be on different workers. So I'm not sure a generic fallback which wouldn't make any copy is a good idea in that case. |
Discussion continued here |
Feature request
I am not too familiar with Tables.jl but understand that if I am interacting with an in-memory data source - a
DataFrame
, say - then I do not have random access to rows of my frame through the interface. Rather, I can request a row iterator. If I only want a selection of rows, then (using the Tables.jl interface) I must iterate through all the rows until I have fetched the ones I need, no?Would it be possible for Tables.jl to provided random access row accessor method (i.e., getindex) method that is performant for those data sources that actually support random access (
DataFrames
, the dense and sparseJuliaDB
formats,TypedTables
,TimeSeries
, etc) with a slower fallback for out-of-memory or distributed sources?Use case
I am involved in the development of MLJ.jl, a Julia machine learning toolbox for model tuning, composing, etc. Ideally we would like our toolbox to be somewhat data container agnostic. The main use cases would be data loaded into memory, in which case we want fast random row-access to make tuning by cross-validation performant. We could limit ourselves to in-memory data sources and convert all such data (using IteratibleTables, say) into a
DataFrame
(say). It seems to me however, that a superior solution would be to leave data sources alone and interact with them through Tables.jl, which would also allow us to interact with out-of-memory or distributed sources, at an albeit less than optimal way?The text was updated successfully, but these errors were encountered: