Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF-16 encoded files in fread #2560

Open
danielsjf opened this issue Jan 9, 2018 · 14 comments
Open

Support UTF-16 encoded files in fread #2560

danielsjf opened this issue Jan 9, 2018 · 14 comments
Labels
encoding issues related to Encoding feature request fread

Comments

@danielsjf
Copy link

I am trying to open a file with '00' bytes in it. More specifically this happens every other byte. At the same time, there are also two bytes in the beginning that seem to specify only the encoding. I haven't been able to read this file with fread. See also the full example (+reprex) in this stackoverflow post:
https://stackoverflow.com/questions/48169100/reading-a-tsv-with-specific-encoding-initial-two-bytes-and-utf-8-afterwards-an

When I googled the issue I found similar problems:
https://q-a-assistant.info/computer-internet-technology/r-data-table-error-in-fread-embedded-nul-in-string-0-0-0-000/264557
https://stackoverflow.com/questions/31701365/error-with-fread-in-r-embedded-nul-in-string-0
https://stackoverflow.com/questions/22643372/embedded-nul-in-string-error-when-importing-csv-with-fread?lq=1

But the work-around is not sufficient:

  • The first one is just opening it with excel and saving it again. Since the data that I use is downloaded automatically from another source (without user interaction), I don't have this option. Even worse, even if I would like to do it in that way, the file is longer than the 1M rows of excel so I'm not able to use this for the entire file.
  • The subsequent ones use a Linux command and when I try to use it inside the fread command, it doesn't work (the person from the first post has the same issue).

Would it be possible to skip the NUL values, just as the base functions do? See readLines (skipNul) or read.table (skipNul).

This is how the file shows up in a hex editor:
image

First 100 bytes of the file: test_file.txt
It's actually a tsv file but github doesn't allow that format.

# Reprex

file <- 'test_file.txt'

# fread from data.table is not able to read the file
tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2): embedded nul in string: 'ÿþy\0e\0a\0r\0'

# It also doesn't work with sed, potentially since I'm on Windows
tmp <- data.table::fread(paste0("sed 's/\\0//g' '", file, "'"), nrows = 2)
#> Error in data.table::fread(paste0("sed 's/\\0//g' '", file, "'"), nrows = 2): embedded nul in string: 'ÿþy\0e\0a\0r\0'

# Output of sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 tools_3.4.1 yaml_2.1.15 stringi_1.1.6 data.table_1.10.4-3
[7] stringr_1.2.0

@MichaelChirico
Copy link
Member

Please install the current development version as this particular issue has seen some progress lately.

I get a different error on the file you provided:

fread('~/Downloads/test_file.txt')
# Error in fread("~/Downloads/test_file.txt") : 
#   File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

@danielsjf
Copy link
Author

With the latest development version, I get an error that seems unrelated:

tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2) : object 'CfreadR' not found

So I would trust your version more. It seems logical that it is indeed some kind of a UTF-16 version. Will there be an option in the future to read UTF-16 files directly?

@MichaelChirico
Copy link
Member

You're having an update issue -- try uninstalling data.table completely first...

From the Installation wiki:

On Windows, when upgrading any package that uses compiled code, it appears to be important to close all R sessions before upgrading. This releases all locks that Windows holds on dlls. To be really sure, reboot too. Then open a new fresh R session.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jan 9, 2018

The problem with the UTF-16 theory is that Atom opens the file as UTF-8 and points out the leading NUL characters:

screen shot 2018-01-09 at 11 26 22 am

So I guess this is in fact an outstanding bug.

@MichaelChirico
Copy link
Member

Tagging related issues: #2247, #2496, #2435

@danielsjf
Copy link
Author

Re-installed data.table and now I also see the UTF-16 error.

@danielsjf
Copy link
Author

Both Sublime text 3 and notepad++ also read it correctly. Sublime seems to ignore the first two bytes and subsequently skips the NUL bytes. Notepad++ does the same but the encoding is marked as "UCS-2 LE BOM". After some Googling I've found that that was a predecessor of UTF-16. The standards are almost exactly the same which can often cause some confusion.

@MichaelChirico
Copy link
Member

Following the approach here it seems Chrome reads the page as "UTF-16LE".

@st-pasha st-pasha changed the title embedded nul in string Support UTF-16 encoded files in fread Jan 9, 2018
@HughParsonage
Copy link
Member

For completeness, file off git bash says it is Little-endian UTF-16 Unicode text, with CRLF line terminators, which is consistent with the fread error message. But I guess the point is moot: without a header, you can only infer the encoding.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jan 10, 2018

Quoting from Wikipedia:

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianess, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.

The attached file is indeed the referenced U+FFFE as a BOM... I'm not familiar enough with the details of encoding to know how to handle this case. Certainly for this file it appears we can just treat U+FFFE as the BOM, axe it, and read the rest of the file...

Adding the verbose output since it offers some further insight:

fread('test_file.txt', verbose = TRUE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
#   Using 8 threads (omp_get_max_threads()=8, nth=8)
#   NAstrings = [<<NA>>]
#   None of the NAstrings look like numbers.
#   show progress = 1
#   0/1 column will be read as boolean
# [02] Opening the file
#   Opening file /Users/michael.chirico/Downloads/test_file.txt
#   File opened, size = 1000 bytes.
#   Memory mapped ok
# [03] Detect and skip BOM

then the error; the fread.c code specifically looks for the FFFE/FEFF marker before erroring:

else if (fileSize >= 2 && sof[0] + sof[1] == '\xFE' + '\xFF') {  // either 0xFE 0xFF or 0xFF 0xFE
    STOP("File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.");
  }

@MichaelChirico
Copy link
Member

@st-pasha
Copy link
Contributor

@MichaelChirico UTF-16 was created in the days when it was believed that 65536 Unicode characters will be enough for everybody. Since this is no longer true, UTF-16 uses either 2 or 4 bytes to store every Unicode character. Which makes it super inconvenient: even a simple string like Hello, world! takes 26 bytes in this encoding. The encoded string will look like this: H \0 e \0 l \0 l \0 o \0 , \0 \0 w \0 o \0 r \0 l \0 d \0 ! \0 (where each \0 is a NUL byte). This is why you can't pass this file to fread: it is not tolerant to NUL bytes...

Writing parsers specifically for UTF-16 encoding is way too much trouble -- an easier approach is to first recode the input into UTF-8, and then use fread. I believe iconv() R function can do that (and in python we use PyUnicode_DecodeUTF16()).

@MichaelChirico
Copy link
Member

@st-pasha so should we won't-fix this issue and add this note to the documentation?

@st-pasha
Copy link
Contributor

On the contrary, since the solution is relatively easy (just slow), we should do it ourselves. It's perfectly fine to be slow in rare cases.
So in that code snippet that you posted above, we should replace STOP(...) with a call to R function iconv and then replace the original input buffer with the decoded one. There may even be an R C-API function to do the same.

@jangorecki jangorecki added the encoding issues related to Encoding label Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding issues related to Encoding feature request fread
Projects
None yet
Development

No branches or pull requests

5 participants