-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Feature request
I am not too familiar with Tables.jl but understand that if I am interacting with an in-memory data source - a DataFrame, say - then I do not have random access to rows of my frame through the interface. Rather, I can request a row iterator. If I only want a selection of rows, then (using the Tables.jl interface) I must iterate through all the rows until I have fetched the ones I need, no?
Would it be possible for Tables.jl to provided random access row accessor method (i.e., getindex) method that is performant for those data sources that actually support random access (DataFrames, the dense and sparse JuliaDB formats, TypedTables, TimeSeries, etc) with a slower fallback for out-of-memory or distributed sources?
Use case
I am involved in the development of MLJ.jl, a Julia machine learning toolbox for model tuning, composing, etc. Ideally we would like our toolbox to be somewhat data container agnostic. The main use cases would be data loaded into memory, in which case we want fast random row-access to make tuning by cross-validation performant. We could limit ourselves to in-memory data sources and convert all such data (using IteratibleTables, say) into a DataFrame (say). It seems to me however, that a superior solution would be to leave data sources alone and interact with them through Tables.jl, which would also allow us to interact with out-of-memory or distributed sources, at an albeit less than optimal way?