Skip to content

HTML/CSV export corrupts UTF-8 characters outside of Basic Multilingual Pane (BMP) ie code point >10000 #1197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Yuutakasan opened this issue Jun 6, 2017 · 27 comments · Fixed by #2722
Assignees
Labels
CSV/TSV About the CSV/TSV import or export encoding Selection of encoding at import time, or encoding issues in data cleaning export Exporting a project to some format. Use the format-specific sub-label if available import About importers in general - add a label for the data format if available Priority: High Denotes issues that require urgent attention and may be blocking progress. Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Milestone

Comments

@Yuutakasan
Copy link

Yuutakasan commented Jun 6, 2017

OpenRefine 2.7 rc2

After reading UTF 8 file and executing export as UTF 8 file, garbled characters occurred.

displayed characters
image
有限会社なべ茶屋あさ𡌛

Exported garbled characters
image
有限会社なべ茶屋あさ����

other garbled export charactor sample
𣘺𣳾

@Yuutakasan Yuutakasan changed the title Using export garbled characters. After reading UTF 8 file and executing export as UTF 8 file, garbled characters occurred. Jun 6, 2017
@thadguidry
Copy link
Member

thadguidry commented Jun 6, 2017

@Yuutakasan hmm, that's weird. What is interesting is that it is showing 4 bytes (4 question marks) to hold just 1 character. I see that the code for that last character is actually 6 bytes however (which is the maximum that UTF-8 can hold per character.

𡌛 = \x0A\xF0\xA1\x8C\x9B\x0A

Further interesting is that when I copy and paste your last character into a single OpenRefine cell, I actually get a different character...

ጛ = \xE1\x8C\x9B

instead of

𡌛 = \xF0\xA1\x8C\x9B

@jackyq2015 Can you debug this ?

@Yuutakasan
Copy link
Author

Yuutakasan commented Jun 6, 2017

I will attach a sample file for reference.

import file
import.txt
export file
export.txt

There is a sense that this letter is actually used in the name of the corporation registered in Japan.

有限会社なべ茶屋あさ𡌛
株式会社石𣘺組
有限会社𣳾新商事
𣳾幸1合同会社

@thadguidry
Copy link
Member

thadguidry commented Jun 6, 2017

@Yuutakasan When I export your import.txt file... I get
capture

有限会社なべ茶屋あさ𡌛

You are probably not using a viewer like Notepad++ or similar that can show that last character as being \xED\xA1\x84\xED\xBC\x9B ?

But regardless... its a bug somewhere because somehow during export we change the bytes...

from
\xF0\xA1\x8C\x9B

to
\xED\xA1\x84\xED\xBC\x9B

@Yuutakasan
Copy link
Author

I currently use EmEditor, I will try using Notpad ++.
EmEditor
https://www.emeditor.com/

@Yuutakasan
Copy link
Author

I was able to reproduce the same phenomenon.
image

@thadguidry thadguidry added the Type: Bug Issues related to software defects or unexpected behavior, which require resolution. label Jun 6, 2017
@Yuutakasan
Copy link
Author

Yuutakasan commented Jun 6, 2017

thank you. @thadguidry.
I tried testing with multiple character codes.
It seems that a character with a code point of 10000 or more will be garbled.

import txt
import-test-sample.txt
image

export txt
export-test-sample-txt.txt
image

@thadguidry
Copy link
Member

@Yuutakasan Thanks, we'll have to let @jackyq2015 look into this specifically. My hunch is that we might not actually be storing it correctly in cell and so this https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/exporters/CsvExporter.java#L108 might be giving back the wrong data in the first place. Otherwise its an issue in csvwriter itself here https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/exporters/CsvExporter.java#L114

@jackyq2015
Copy link
Contributor

can you try to add -Dfile.encoding=UTF-8 to java command options to enforce the encoding?

@Yuutakasan
Copy link
Author

OK.I will try it!

@Yuutakasan
Copy link
Author

@jackyq2015
Can I test by adding "-Dfile.encoding = UTF-8" setting to the openrefine.l4j.ini file?

@thadguidry
Copy link
Member

@Yuutakasan Yes, but you can also test it by adding it to the refine.ini file and starting refine.bat or refine.sh if your on Linux. Just uncomment the JAVA_OPTIONS= line

@Yuutakasan
Copy link
Author

Since there is no pervert, I was worried whether the setting really worked.
it is executed once, I will share the result later.

@jackyq2015
Copy link
Contributor

@Yuutakasan Given your description, your file is not properly decoded as utf8. that's why I asked you to enforce it. Please note that system cannot 100% accurate to detect the encoding of random stream. There is some library like icu4j can help to improve the accuracy. Actually there is a PR(not merged yet) to introduce it. If you want to wet your hand, you can create your own branch and merge the PR to your own branch and have a try.

@wetneb wetneb added encoding Selection of encoding at import time, or encoding issues in data cleaning import About importers in general - add a label for the data format if available export Exporting a project to some format. Use the format-specific sub-label if available labels Aug 2, 2017
@wetneb wetneb added CSV/TSV About the CSV/TSV import or export and removed import About importers in general - add a label for the data format if available labels Sep 18, 2017
@jackyq2015 jackyq2015 added the Priority: High Denotes issues that require urgent attention and may be blocking progress. label Oct 25, 2017
@Yuutakasan
Copy link
Author

@jackyq2015 @thadguidry
Sorry for being late. I tried the settings I got the other day.

  1. I downloaded openrefine - 2.8.
  2. It changed to the following setting.
    openrefine.l4j.zip
  3. I imported garbled data before.
    import-test-sample.zip
    image

It is displayed normally

image

image

4.The exported file is garbled.
export-test-sample-txt.zip
image

@Yuutakasan
Copy link
Author

I think that it is a character string conversion mistake at export timing, not an encoding discrimination bug at import timing.

@Yuutakasan
Copy link
Author

other export pattern
Excel
import-test-sample-xlsx.zip
image

HTML
image
import-test-sample-html.zip

@jackyq2015
Copy link
Contributor

can you please add the encoding switch I provided above and try again?

@Yuutakasan
Copy link
Author

Yuutakasan commented Nov 22, 2017

@jackyq2015
The above processing is executed with the following settings.

configfile
https://github.com/OpenRefine/OpenRefine/files/1492661/openrefine.l4j.zip

openrefine.l4j.ini
***********************
-Xms256M
-Xmx1024M
-Djava.net.useSystemProxies=true
-Dfile.encoding="UTF-8"

@wetneb wetneb added the import About importers in general - add a label for the data format if available label Dec 22, 2019
@tfmorris tfmorris self-assigned this Jun 11, 2020
@tfmorris tfmorris added this to the 3.4 milestone Jun 11, 2020
@tfmorris
Copy link
Member

Hopefully this has been fixed, but we should confirm for the 3.4 release.

@tfmorris
Copy link
Member

From my limited testing, it looks like XLSX export is OK (at least for Numbers on my Mac), CSV is totally corrupted, and HTML broken for the higher code points as shown above.

@tfmorris
Copy link
Member

@Yuutakasan Sorry for the long delay. The fix for this should make it into 3.4.

@tfmorris tfmorris changed the title After reading UTF 8 file and executing export as UTF 8 file, garbled characters occurred. Export corrupts UTF-8 characters outside of Basic Multilingual Pane (BMP) ie code point >10000 Jun 13, 2020
@tfmorris tfmorris changed the title Export corrupts UTF-8 characters outside of Basic Multilingual Pane (BMP) ie code point >10000 HTML/CSV export corrupts UTF-8 characters outside of Basic Multilingual Pane (BMP) ie code point >10000 Jun 13, 2020
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Jun 14, 2020
Fixes OpenRefine#1197. Previously we were using a funky ContentType to attempt
to force a file download rather than display in browser, but this
conflicted with attempts to save UTF-8 which was outside the Basic
Multilingual Plane (BMP).

By switching to ContentDisposition: attachment, which has been
the preferred method for a number of years, we can avoid this conflict.

As part of this, switch to using the "preview" param consistently
to control preview vs download rather than the content type.
@Yuutakasan
Copy link
Author

thank you. I'll test.

wetneb pushed a commit that referenced this issue Jun 16, 2020
)

* Use ContentDisposition instead of ContentType to control download

Fixes #1197. Previously we were using a funky ContentType to attempt
to force a file download rather than display in browser, but this
conflicted with attempts to save UTF-8 which was outside the Basic
Multilingual Plane (BMP).

By switching to ContentDisposition: attachment, which has been
the preferred method for a number of years, we can avoid this conflict.

As part of this, switch to using the "preview" param consistently
to control preview vs download rather than the content type.

* Switch content type to text/plain

Now that we don't need to use ContentType to control download
behavior, we can use something more reasonable.
@Yuutakasan
Copy link
Author

Yuutakasan commented Jun 18, 2020

@tfmorris
I've tested it and it still seems to cause garbled text.
@wetneb Could you please open this issue?

USE OpenRefine 3.4 beta

1.import-test-sample.txt import ( No more problems than before. )
import-test-sample.txt
image
2. Configure Parsing Options ( No more problems than before. )
image
3.create project ( No more problems than before. )
image
4.export( Garbled characters in some formats. )
I used to test for HTML and CSV, but I also tested for other formats.
export openrefine 3.4.zip

①export tsv (Garbled characters)
image

②export csv (Garbled characters)
image

③export html (Garbled characters)
View in a text editor
image
View in a chrome
image

④export excel (NOT Garbled characters)
image

⑤export excel2007+ (NOT Garbled characters)
image

⑥export ODF SpreadSheet (NOT Garbled characters)
image

⑦export SQL (NOT Garbled characters)
image

⑧export SpreadSheet (NOT Garbled characters)
image
https://docs.google.com/spreadsheets/d/12zaOy_Mh9d-85Cv7pVXYV_TTc9gJJALR8gbXi-aAvuQ/edit?usp=sharing

@wetneb
Copy link
Member

wetneb commented Jun 18, 2020

@Yuutakasan this has not been fixed in 3.4 beta - that version was released before this fix. For a version that we expect not to have the issue, try this one:
https://github.com/OpenRefine/OpenRefine-nightly-releases/releases/tag/3.4-beta-148-gf88c0e3

wetneb pushed a commit that referenced this issue Jun 18, 2020
)

* Use ContentDisposition instead of ContentType to control download

Fixes #1197. Previously we were using a funky ContentType to attempt
to force a file download rather than display in browser, but this
conflicted with attempts to save UTF-8 which was outside the Basic
Multilingual Plane (BMP).

By switching to ContentDisposition: attachment, which has been
the preferred method for a number of years, we can avoid this conflict.

As part of this, switch to using the "preview" param consistently
to control preview vs download rather than the content type.

* Switch content type to text/plain

Now that we don't need to use ContentType to control download
behavior, we can use something more reasonable.
@Yuutakasan
Copy link
Author

@wetneb Thanx. I'll re-test.

@Yuutakasan
Copy link
Author

Yuutakasan commented Jun 18, 2020

@wetneb @tfmorris
I have confirmed that the garbling has been resolved. Thank you very much.

USE openrefine-win-3.4-beta-148-gf88c0e3
https://github.com/OpenRefine/OpenRefine-nightly-releases/releases/tag/3.4-beta-148-gf88c0e3

export openrefine 3.4-beta-148-gf88c0e3.zip

export tsv
image

export csv
image

export html
image

export excel
image

export excel2007+
image

export odf
image

export spreadsheet
image
https://docs.google.com/spreadsheets/d/1TFTVPR-H-CDrGDC7yw252qc3QoBmOJVS9Uztu9VD7Eo/edit?usp=sharing

@tfmorris
Copy link
Member

Excellent. Thank you very much for testing @Yuutakasan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CSV/TSV About the CSV/TSV import or export encoding Selection of encoding at import time, or encoding issues in data cleaning export Exporting a project to some format. Use the format-specific sub-label if available import About importers in general - add a label for the data format if available Priority: High Denotes issues that require urgent attention and may be blocking progress. Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Projects
None yet
5 participants