Add docs for reading non-UTF-8 files #685

nalimilan · 2020-07-09T08:44:16Z

It could make sense for StringEncodings to allow read(f, enc"ISO-8859-1") as a shorter variant of open(read, f, enc"ISO-8859-1") (JuliaStrings/StringEncodings.jl#37). Currently all functions that decode return strings rather than vectors of bytes, but that would probably be OK. Anyway, for CSV.jl it's better to document something that works with older releases.

codecov · 2020-07-09T10:35:55Z

Codecov Report

Merging #685 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #685   +/-   ##
=======================================
  Coverage   84.16%   84.16%           
=======================================
  Files          10       10           
  Lines        1800     1800           
=======================================
  Hits         1515     1515           
  Misses        285      285

Impacted Files	Coverage Δ
src/file.jl	`95.01% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2658645...e48968a. Read the comment docs.

kragol · 2020-07-09T21:50:40Z

You might also want to add a caveat about the performance.

I recently tried to read some 4GB .csv non UTF-8 files using StringEncodings: it was roughly 5X to 10X slower than reading the same file after it had been statically converted to UTF-8. Best part was that statically converting the file to UTF-8 (using iconv on Linux) and then reading it with CSV.File was still way faster than reading it as non UTF-8 by means of StringEncodings.

Of course I guess one cannot always statically convert to UTF-8 but it clearly is the better option at the moment when it is possible.

quinnj · 2020-07-10T14:19:42Z

This is great; thanks @nalimilan. @kragol, I think the difference in timings your seeing is just the actual conversion cost; when you do CSV.File(open(file, enc"whatever")) it has to first read the entire file in whatever encoding, then do the csv parsing. So to do a more fair comparison, you'd wand to time doing the iconv re-encoding, then reading CSV.File.

nalimilan · 2020-07-10T14:27:47Z

Actually I had not done any performance optimization in StringEncodings, and it was really slow. Profiling showed that most of the time was in calling readbytes! on the underlying data, due to a lack of a specialized method. With JuliaStrings/StringEncodings.jl#38 it's about 10 times faster, and in my test it's "only" about twice slower than parsing a UTF-8 CSV file.

quinnj · 2020-07-10T14:28:49Z

Ah, that's great to hear!

nalimilan added 2 commits July 9, 2020 10:37

Add docs for reading non-UTF-8 files

34f64c8

Improve docstring

e48968a

quinnj merged commit 4136307 into master Jul 10, 2020

quinnj deleted the nl/encodings branch July 10, 2020 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs for reading non-UTF-8 files #685

Add docs for reading non-UTF-8 files #685

nalimilan commented Jul 9, 2020 •

edited

codecov bot commented Jul 9, 2020 •

edited

kragol commented Jul 9, 2020

quinnj commented Jul 10, 2020

nalimilan commented Jul 10, 2020

quinnj commented Jul 10, 2020

Add docs for reading non-UTF-8 files #685

Add docs for reading non-UTF-8 files #685

Conversation

nalimilan commented Jul 9, 2020 • edited

codecov bot commented Jul 9, 2020 • edited

Codecov Report

kragol commented Jul 9, 2020

quinnj commented Jul 10, 2020

nalimilan commented Jul 10, 2020

quinnj commented Jul 10, 2020

nalimilan commented Jul 9, 2020 •

edited

codecov bot commented Jul 9, 2020 •

edited