Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when reading csv in non UTF-8 encoding #1022

Closed
etibarg opened this issue Sep 2, 2022 · 8 comments · Fixed by JuliaStrings/StringEncodings.jl#53
Closed

Error when reading csv in non UTF-8 encoding #1022

etibarg opened this issue Sep 2, 2022 · 8 comments · Fixed by JuliaStrings/StringEncodings.jl#53
Labels

Comments

@etibarg
Copy link

etibarg commented Sep 2, 2022

Hello,

According to https://docs.juliahub.com/CSV/HHBkp/0.10.4/examples.html#stringencodings, reading csv encoded in non-utf-8 format is done using StringEncodings.
Example code provided:

file = CSV.File(open("iso8859_encoded_file.csv", enc"ISO-8859-1"))

However, I can't make it work as suggested:

julia> file = CSV.File(open("EBNM_cropped_im\\output_detection_EBNM.csv", enc"UCS-2LE"))
ERROR: MethodError: no method matching readavailable(::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
Closest candidates are:
  readavailable(::Base.AbstractPipe) at io.jl:427
  readavailable(::Base.GenericIOBuffer) at iobuffer.jl:467
  readavailable(::Base.LibuvStream) at stream.jl:983
  ...
Stacktrace:
 [1] write(to::TranscodingStreams.NoopStream{IOStream}, from::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
   @ Base .\io.jl:753
 [2] buffer_to_tempfile
   @ C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:311 [inlined]
 [3] getbytebuffer(x::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream}, buffer_in_memory::Bool)
   @ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:268
 [4] getsource(x::Any, buffer_in_memory::Bool)
   @ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:288
 [5] CSV.Context(source::CSV.Arg, header::CSV.Arg, normalizenames::CSV.Arg, datarow::CSV.Arg, skipto::CSV.Arg, footerskip::CSV.Arg, transpose::CSV.Arg, comment::CSV.Arg, ignoreemptyrows::CSV.Arg, ignoreemptylines::CSV.Arg, select::CSV.Arg, drop::CSV.Arg, limit::CSV.Arg, buffer_in_memory::CSV.Arg, threaded::CSV.Arg, ntasks::CSV.Arg, tasks::CSV.Arg, rows_to_check::CSV.Arg, lines_to_check::CSV.Arg, missingstrings::CSV.Arg, missingstring::CSV.Arg, delim::CSV.Arg, ignorerepeated::CSV.Arg, quoted::CSV.Arg, quotechar::CSV.Arg, openquotechar::CSV.Arg, closequotechar::CSV.Arg, escapechar::CSV.Arg, dateformat::CSV.Arg, dateformats::CSV.Arg, decimal::CSV.Arg, truestrings::CSV.Arg, falsestrings::CSV.Arg, stripwhitespace::CSV.Arg, type::CSV.Arg, types::CSV.Arg, typemap::CSV.Arg, pool::CSV.Arg, downcast::CSV.Arg, lazystrings::CSV.Arg, stringtype::CSV.Arg, strict::CSV.Arg, silencewarnings::CSV.Arg, maxwarnings::CSV.Arg, debug::CSV.Arg, parsingdebug::CSV.Arg, validate::CSV.Arg, streaming::CSV.Arg)
   @ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\context.jl:304
 [6] #File#25
   @ C:\Users\XXX\.julia\packages\CSV\jFiCn\src\file.jl:221 [inlined]
 [7] CSV.File(source::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
   @ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\file.jl:162
 [8] top-level scope
   @ REPL[14]:1

Notes:

  • According to notepad++, encoding is "UTF-16 LE BOM" / "UCS-2 LE"
  • Still in notepad++, when I change encoding to "UTF-8", I can read the file.

Versioninfo:

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab7 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 8 on 16 virtual cores
Environment:
  JULIA_PKG_DEVDIR = C:/Users/XXX/Devdir
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
@quinnj
Copy link
Member

quinnj commented Sep 2, 2022

Bummer; it seems like the StringDecoder object doesn't implement the expected IO interface we're relying on internally.

@quinnj
Copy link
Member

quinnj commented Sep 2, 2022

A work around would be to do file = CSV.File(read(open("iso8859_encoded_file.csv", enc"ISO-8859-1")))

@etibarg
Copy link
Author

etibarg commented Sep 2, 2022

It works, thanks!

@cirocavani
Copy link

I had the same problem.

Maybe update the examples?

CSV.File docstring already suggest similar work around:

https://csv.juliadata.org/stable/reading.html#CSV.File

For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, input, enc"ISO-8859-1")).

@nalimilan
Copy link
Member

I've implemented a fix at JuliaStrings/StringEncodings.jl#53.

Though @quinnj do you think CSV.jl could use a more efficient approach? My implementation of readavailable returns copies of the data (by batches of 200 bytes), as the function is documented to return a Vector{UInt8} (otherwise I would return a SubArray). But that copy is only used by the write(::IO, ::IO) fallback to write its contents to another stream so it's really wasteful. On the contrary, StringDecoder has an efficient implementation of readbytes!, which isn't used here. Do you know a better API buffer_to_tempfile could use? Or maybe that's a problem in Julia and the write(::IO, ::IO) fallback should be improved, or maybe even readavailable changed to support returning views?

@nalimilan
Copy link
Member

I'm releasing a new StringEncodings version to fix this, but it would still be interesting to check whether a more efficient solution could be implemented @quinnj.

@nalimilan
Copy link
Member

Bump @quinnj.

@quinnj
Copy link
Member

quinnj commented Jun 30, 2023

Ah, sorry for the slow response. Yeah, @Drvi and I have been jamming on some bigger-picture refactorings that will eventually make there way here, and yes, in the new model, we're using readbytes! instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants