burntsushi's issue #1092

jariji · 2023-05-31T18:10:15Z

Andrew Gallant (aka burntsushi), author of ripgrep and xsv, wrote in 2020 that some CSVs won't work using CSV.jl's then-current strategy.

https://news.ycombinator.com/item?id=24747509

Just thought I'd bring it up in case there's something worth documenting here.

Drvi · 2023-05-31T20:30:48Z

That argument is exactly why in ChunkedCSV.jl we don't "jump and recover" even though I still think that is the most performant strategy. Ideally, the user gets to choose which strategy to employ for their file (e.g. if no string fields are present in the file, then what CSV.jl does is pretty much optimal and safe). Still, I think in practice CSV.jl seems to be safe enough and with some work could be made entirely safe -- it would just need to detect it got to an inconsistent state and use this information to retry with better chunking boundaries.

nickrobinson251 added performance documentation design labels May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

burntsushi's issue #1092

burntsushi's issue #1092

jariji commented May 31, 2023 •

edited

Loading

Drvi commented May 31, 2023

burntsushi's issue #1092

burntsushi's issue #1092

Comments

jariji commented May 31, 2023 • edited Loading

Drvi commented May 31, 2023

jariji commented May 31, 2023 •

edited

Loading