-
Notifications
You must be signed in to change notification settings - Fork 146
Description
It would be good if CSV.File accepted an argument that would be a predicate that would take row number and row itself and allowed to filter rows being read conditional on this predicate.
Simple use cases:
- store only even rows
- store only random sample with 5% of ingested rows
- store only rows that have a certain value in a certain column (e.g. in ML: select only rows that should go to training data set)
Why it is needed: if CSV.jl is optimized to be less memory hungry then such filtering would allow ingesting very large files by reading only a fraction of their contents.
As an additional option one can imagine that this would allow for piping a CSV file in and then writing it back as a CSV (or JSON) filtered without materializing the whole data set in memory (this is a relevant and frequent use case; note that in general this cannot be done by typical UNIX line processing utilities, as one record of a CSV file can in general span multiple rows).
@quinnj - do you think it would be doable?