Skip to content

CSV-row filtering when reading #503

@bkamins

Description

@bkamins

It would be good if CSV.File accepted an argument that would be a predicate that would take row number and row itself and allowed to filter rows being read conditional on this predicate.

Simple use cases:

  • store only even rows
  • store only random sample with 5% of ingested rows
  • store only rows that have a certain value in a certain column (e.g. in ML: select only rows that should go to training data set)

Why it is needed: if CSV.jl is optimized to be less memory hungry then such filtering would allow ingesting very large files by reading only a fraction of their contents.

As an additional option one can imagine that this would allow for piping a CSV file in and then writing it back as a CSV (or JSON) filtered without materializing the whole data set in memory (this is a relevant and frequent use case; note that in general this cannot be done by typical UNIX line processing utilities, as one record of a CSV file can in general span multiple rows).

@quinnj - do you think it would be doable?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions