Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for reading non-UTF-8 files #685

Merged
merged 2 commits into from
Jul 10, 2020
Merged

Add docs for reading non-UTF-8 files #685

merged 2 commits into from
Jul 10, 2020

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Jul 9, 2020

It could make sense for StringEncodings to allow read(f, enc"ISO-8859-1") as a shorter variant of open(read, f, enc"ISO-8859-1") (JuliaStrings/StringEncodings.jl#37). Currently all functions that decode return strings rather than vectors of bytes, but that would probably be OK. Anyway, for CSV.jl it's better to document something that works with older releases.

@codecov
Copy link

codecov bot commented Jul 9, 2020

Codecov Report

Merging #685 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #685   +/-   ##
=======================================
  Coverage   84.16%   84.16%           
=======================================
  Files          10       10           
  Lines        1800     1800           
=======================================
  Hits         1515     1515           
  Misses        285      285           
Impacted Files Coverage Δ
src/file.jl 95.01% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2658645...e48968a. Read the comment docs.

@kragol
Copy link
Contributor

kragol commented Jul 9, 2020

You might also want to add a caveat about the performance.

I recently tried to read some 4GB .csv non UTF-8 files using StringEncodings: it was roughly 5X to 10X slower than reading the same file after it had been statically converted to UTF-8. Best part was that statically converting the file to UTF-8 (using iconv on Linux) and then reading it with CSV.File was still way faster than reading it as non UTF-8 by means of StringEncodings.

Of course I guess one cannot always statically convert to UTF-8 but it clearly is the better option at the moment when it is possible.

@quinnj quinnj merged commit 4136307 into master Jul 10, 2020
@quinnj quinnj deleted the nl/encodings branch July 10, 2020 14:16
@quinnj
Copy link
Member

quinnj commented Jul 10, 2020

This is great; thanks @nalimilan. @kragol, I think the difference in timings your seeing is just the actual conversion cost; when you do CSV.File(open(file, enc"whatever")) it has to first read the entire file in whatever encoding, then do the csv parsing. So to do a more fair comparison, you'd wand to time doing the iconv re-encoding, then reading CSV.File.

@nalimilan
Copy link
Member Author

Actually I had not done any performance optimization in StringEncodings, and it was really slow. Profiling showed that most of the time was in calling readbytes! on the underlying data, due to a lack of a specialized method. With JuliaStrings/StringEncodings.jl#38 it's about 10 times faster, and in my test it's "only" about twice slower than parsing a UTF-8 CSV file.

@quinnj
Copy link
Member

quinnj commented Jul 10, 2020

Ah, that's great to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants