Skip to content

Random access to rows for some in-memory sources #48

@ablaom

Description

@ablaom

Feature request

I am not too familiar with Tables.jl but understand that if I am interacting with an in-memory data source - a DataFrame, say - then I do not have random access to rows of my frame through the interface. Rather, I can request a row iterator. If I only want a selection of rows, then (using the Tables.jl interface) I must iterate through all the rows until I have fetched the ones I need, no?

Would it be possible for Tables.jl to provided random access row accessor method (i.e., getindex) method that is performant for those data sources that actually support random access (DataFrames, the dense and sparse JuliaDB formats, TypedTables, TimeSeries, etc) with a slower fallback for out-of-memory or distributed sources?

Use case

I am involved in the development of MLJ.jl, a Julia machine learning toolbox for model tuning, composing, etc. Ideally we would like our toolbox to be somewhat data container agnostic. The main use cases would be data loaded into memory, in which case we want fast random row-access to make tuning by cross-validation performant. We could limit ourselves to in-memory data sources and convert all such data (using IteratibleTables, say) into a DataFrame (say). It seems to me however, that a superior solution would be to leave data sources alone and interact with them through Tables.jl, which would also allow us to interact with out-of-memory or distributed sources, at an albeit less than optimal way?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions