-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UTF-16 encoded files in fread #2560
Comments
Please install the current development version as this particular issue has seen some progress lately. I get a different error on the file you provided:
|
With the latest development version, I get an error that seems unrelated:
So I would trust your version more. It seems logical that it is indeed some kind of a UTF-16 version. Will there be an option in the future to read UTF-16 files directly? |
You're having an update issue -- try uninstalling From the Installation wiki:
|
The problem with the So I guess this is in fact an outstanding bug. |
Re-installed data.table and now I also see the UTF-16 error. |
Both Sublime text 3 and notepad++ also read it correctly. Sublime seems to ignore the first two bytes and subsequently skips the NUL bytes. Notepad++ does the same but the encoding is marked as "UCS-2 LE BOM". After some Googling I've found that that was a predecessor of UTF-16. The standards are almost exactly the same which can often cause some confusion. |
Following the approach here it seems Chrome reads the page as |
For completeness, |
Quoting from Wikipedia:
The attached file is indeed the referenced Adding the
then the error; the
|
@MichaelChirico UTF-16 was created in the days when it was believed that 65536 Unicode characters will be enough for everybody. Since this is no longer true, UTF-16 uses either 2 or 4 bytes to store every Unicode character. Which makes it super inconvenient: even a simple string like Writing parsers specifically for UTF-16 encoding is way too much trouble -- an easier approach is to first recode the input into UTF-8, and then use fread. I believe |
@st-pasha so should we |
On the contrary, since the solution is relatively easy (just slow), we should do it ourselves. It's perfectly fine to be slow in rare cases. |
I am trying to open a file with '00' bytes in it. More specifically this happens every other byte. At the same time, there are also two bytes in the beginning that seem to specify only the encoding. I haven't been able to read this file with fread. See also the full example (+reprex) in this stackoverflow post:
https://stackoverflow.com/questions/48169100/reading-a-tsv-with-specific-encoding-initial-two-bytes-and-utf-8-afterwards-an
When I googled the issue I found similar problems:
https://q-a-assistant.info/computer-internet-technology/r-data-table-error-in-fread-embedded-nul-in-string-0-0-0-000/264557
https://stackoverflow.com/questions/31701365/error-with-fread-in-r-embedded-nul-in-string-0
https://stackoverflow.com/questions/22643372/embedded-nul-in-string-error-when-importing-csv-with-fread?lq=1
But the work-around is not sufficient:
Would it be possible to skip the NUL values, just as the base functions do? See readLines (skipNul) or read.table (skipNul).
This is how the file shows up in a hex editor:
First 100 bytes of the file: test_file.txt
It's actually a tsv file but github doesn't allow that format.
#
Reprex
#
Output of sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 tools_3.4.1 yaml_2.1.15 stringi_1.1.6 data.table_1.10.4-3
[7] stringr_1.2.0
The text was updated successfully, but these errors were encountered: