Support UTF-16 encoded files in fread #2560

danielsjf · 2018-01-09T16:01:11Z

I am trying to open a file with '00' bytes in it. More specifically this happens every other byte. At the same time, there are also two bytes in the beginning that seem to specify only the encoding. I haven't been able to read this file with fread. See also the full example (+reprex) in this stackoverflow post:
https://stackoverflow.com/questions/48169100/reading-a-tsv-with-specific-encoding-initial-two-bytes-and-utf-8-afterwards-an

When I googled the issue I found similar problems:
https://q-a-assistant.info/computer-internet-technology/r-data-table-error-in-fread-embedded-nul-in-string-0-0-0-000/264557
https://stackoverflow.com/questions/31701365/error-with-fread-in-r-embedded-nul-in-string-0
https://stackoverflow.com/questions/22643372/embedded-nul-in-string-error-when-importing-csv-with-fread?lq=1

But the work-around is not sufficient:

The first one is just opening it with excel and saving it again. Since the data that I use is downloaded automatically from another source (without user interaction), I don't have this option. Even worse, even if I would like to do it in that way, the file is longer than the 1M rows of excel so I'm not able to use this for the entire file.
The subsequent ones use a Linux command and when I try to use it inside the fread command, it doesn't work (the person from the first post has the same issue).

Would it be possible to skip the NUL values, just as the base functions do? See readLines (skipNul) or read.table (skipNul).

This is how the file shows up in a hex editor:

First 100 bytes of the file: test_file.txt
It's actually a tsv file but github doesn't allow that format.

# Reprex

file <- 'test_file.txt'

# fread from data.table is not able to read the file
tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2): embedded nul in string: 'Ã¿Ã¾y\0e\0a\0r\0'

# It also doesn't work with sed, potentially since I'm on Windows
tmp <- data.table::fread(paste0("sed 's/\\0//g' '", file, "'"), nrows = 2)
#> Error in data.table::fread(paste0("sed 's/\\0//g' '", file, "'"), nrows = 2): embedded nul in string: 'Ã¿Ã¾y\0e\0a\0r\0'

# Output of sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 tools_3.4.1 yaml_2.1.15 stringi_1.1.6 data.table_1.10.4-3
[7] stringr_1.2.0

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2018-01-09T16:05:46Z

Please install the current development version as this particular issue has seen some progress lately.

I get a different error on the file you provided:

fread('~/Downloads/test_file.txt')
# Error in fread("~/Downloads/test_file.txt") : 
#   File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

danielsjf · 2018-01-09T16:21:03Z

With the latest development version, I get an error that seems unrelated:

tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2) : object 'CfreadR' not found

So I would trust your version more. It seems logical that it is indeed some kind of a UTF-16 version. Will there be an option in the future to read UTF-16 files directly?

MichaelChirico · 2018-01-09T16:24:38Z

You're having an update issue -- try uninstalling data.table completely first...

From the Installation wiki:

On Windows, when upgrading any package that uses compiled code, it appears to be important to close all R sessions before upgrading. This releases all locks that Windows holds on dlls. To be really sure, reboot too. Then open a new fresh R session.

MichaelChirico · 2018-01-09T16:27:11Z

The problem with the UTF-16 theory is that Atom opens the file as UTF-8 and points out the leading NUL characters:

So I guess this is in fact an outstanding bug.

MichaelChirico · 2018-01-09T16:29:37Z

Tagging related issues: #2247, #2496, #2435

danielsjf · 2018-01-09T16:37:52Z

Re-installed data.table and now I also see the UTF-16 error.

danielsjf · 2018-01-09T16:46:01Z

Both Sublime text 3 and notepad++ also read it correctly. Sublime seems to ignore the first two bytes and subsequently skips the NUL bytes. Notepad++ does the same but the encoding is marked as "UCS-2 LE BOM". After some Googling I've found that that was a predecessor of UTF-16. The standards are almost exactly the same which can often cause some confusion.

MichaelChirico · 2018-01-09T16:52:41Z

Following the approach here it seems Chrome reads the page as "UTF-16LE".

HughParsonage · 2018-01-10T03:53:37Z

For completeness, file off git bash says it is Little-endian UTF-16 Unicode text, with CRLF line terminators, which is consistent with the fread error message. But I guess the point is moot: without a header, you can only infer the encoding.

MichaelChirico · 2018-01-10T16:40:45Z

Quoting from Wikipedia:

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianess, the bytes will be swapped, thus delivering the character U+FFFE, which is defined by Unicode as a "non character" that should never appear in the text.

The attached file is indeed the referenced U+FFFE as a BOM... I'm not familiar enough with the details of encoding to know how to handle this case. Certainly for this file it appears we can just treat U+FFFE as the BOM, axe it, and read the rest of the file...

Adding the verbose output since it offers some further insight:

fread('test_file.txt', verbose = TRUE)
# Input contains no \n. Taking this to be a filename to open
# [01] Check arguments
#   Using 8 threads (omp_get_max_threads()=8, nth=8)
#   NAstrings = [<<NA>>]
#   None of the NAstrings look like numbers.
#   show progress = 1
#   0/1 column will be read as boolean
# [02] Opening the file
#   Opening file /Users/michael.chirico/Downloads/test_file.txt
#   File opened, size = 1000 bytes.
#   Memory mapped ok
# [03] Detect and skip BOM

then the error; the fread.c code specifically looks for the FFFE/FEFF marker before erroring:

else if (fileSize >= 2 && sof[0] + sof[1] == '\xFE' + '\xFF') {  // either 0xFE 0xFF or 0xFF 0xFE
    STOP("File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.");
  }

MichaelChirico · 2018-10-19T05:24:00Z

to be updated

https://stackoverflow.com/questions/36862340/fast-method-to-read-csv-with-utf-16le-encoding

st-pasha · 2018-10-21T23:16:37Z

@MichaelChirico UTF-16 was created in the days when it was believed that 65536 Unicode characters will be enough for everybody. Since this is no longer true, UTF-16 uses either 2 or 4 bytes to store every Unicode character. Which makes it super inconvenient: even a simple string like Hello, world! takes 26 bytes in this encoding. The encoded string will look like this: H \0 e \0 l \0 l \0 o \0 , \0 \0 w \0 o \0 r \0 l \0 d \0 ! \0 (where each \0 is a NUL byte). This is why you can't pass this file to fread: it is not tolerant to NUL bytes...

Writing parsers specifically for UTF-16 encoding is way too much trouble -- an easier approach is to first recode the input into UTF-8, and then use fread. I believe iconv() R function can do that (and in python we use PyUnicode_DecodeUTF16()).

MichaelChirico · 2018-10-22T02:59:46Z

@st-pasha so should we won't-fix this issue and add this note to the documentation?

st-pasha · 2018-10-22T20:39:20Z

On the contrary, since the solution is relatively easy (just slow), we should do it ourselves. It's perfectly fine to be slow in rare cases.
So in that code snippet that you posted above, we should replace STOP(...) with a call to R function iconv and then replace the original input buffer with the decoded one. There may even be an R C-API function to do the same.

MichaelChirico added the fread label Jan 9, 2018

st-pasha added the feature request label Jan 9, 2018

st-pasha changed the title ~~embedded nul in string~~ Support UTF-16 encoded files in fread Jan 9, 2018

st-pasha mentioned this issue Jan 9, 2018

Master task for fread bugs / proposals #2247

Closed

MichaelChirico mentioned this issue Jan 11, 2018

UTF-16 file incorrectly tagged as UTF-8 atom/atom#16539

Closed

jangorecki added the encoding issues related to Encoding label Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support UTF-16 encoded files in fread #2560

Support UTF-16 encoded files in fread #2560

danielsjf commented Jan 9, 2018

MichaelChirico commented Jan 9, 2018

danielsjf commented Jan 9, 2018

MichaelChirico commented Jan 9, 2018

MichaelChirico commented Jan 9, 2018 •

edited

MichaelChirico commented Jan 9, 2018

danielsjf commented Jan 9, 2018

danielsjf commented Jan 9, 2018

MichaelChirico commented Jan 9, 2018

HughParsonage commented Jan 10, 2018

MichaelChirico commented Jan 10, 2018 •

edited

MichaelChirico commented Oct 19, 2018

st-pasha commented Oct 21, 2018

MichaelChirico commented Oct 22, 2018

st-pasha commented Oct 22, 2018

Support UTF-16 encoded files in fread #2560

Support UTF-16 encoded files in fread #2560

Comments

danielsjf commented Jan 9, 2018

MichaelChirico commented Jan 9, 2018

danielsjf commented Jan 9, 2018

MichaelChirico commented Jan 9, 2018

MichaelChirico commented Jan 9, 2018 • edited

MichaelChirico commented Jan 9, 2018

danielsjf commented Jan 9, 2018

danielsjf commented Jan 9, 2018

MichaelChirico commented Jan 9, 2018

HughParsonage commented Jan 10, 2018

MichaelChirico commented Jan 10, 2018 • edited

MichaelChirico commented Oct 19, 2018

st-pasha commented Oct 21, 2018

MichaelChirico commented Oct 22, 2018

st-pasha commented Oct 22, 2018

MichaelChirico commented Jan 9, 2018 •

edited

MichaelChirico commented Jan 10, 2018 •

edited