-
-
Notifications
You must be signed in to change notification settings - Fork 2k
HTML/CSV export corrupts UTF-8 characters outside of Basic Multilingual Pane (BMP) ie code point >10000 #1197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Yuutakasan hmm, that's weird. What is interesting is that it is showing 4 bytes (4 question marks) to hold just 1 character. I see that the code for that last character is actually 6 bytes however (which is the maximum that UTF-8 can hold per character. 𡌛 = \x0A\xF0\xA1\x8C\x9B\x0A Further interesting is that when I copy and paste your last character into a single OpenRefine cell, I actually get a different character... ጛ = \xE1\x8C\x9B instead of 𡌛 = \xF0\xA1\x8C\x9B @jackyq2015 Can you debug this ? |
I will attach a sample file for reference. import file There is a sense that this letter is actually used in the name of the corporation registered in Japan. |
@Yuutakasan When I export your import.txt file... I get 有限会社なべ茶屋あさ𡌛 You are probably not using a viewer like Notepad++ or similar that can show that last character as being \xED\xA1\x84\xED\xBC\x9B ? But regardless... its a bug somewhere because somehow during export we change the bytes... from to |
I currently use EmEditor, I will try using Notpad ++. |
thank you. @thadguidry. import txt export txt |
@Yuutakasan Thanks, we'll have to let @jackyq2015 look into this specifically. My hunch is that we might not actually be storing it correctly in cell and so this https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/exporters/CsvExporter.java#L108 might be giving back the wrong data in the first place. Otherwise its an issue in csvwriter itself here https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/exporters/CsvExporter.java#L114 |
can you try to add -Dfile.encoding=UTF-8 to java command options to enforce the encoding? |
OK.I will try it! |
@jackyq2015 |
@Yuutakasan Yes, but you can also test it by adding it to the refine.ini file and starting refine.bat or refine.sh if your on Linux. Just uncomment the JAVA_OPTIONS= line |
Since there is no pervert, I was worried whether the setting really worked. |
@Yuutakasan Given your description, your file is not properly decoded as utf8. that's why I asked you to enforce it. Please note that system cannot 100% accurate to detect the encoding of random stream. There is some library like icu4j can help to improve the accuracy. Actually there is a PR(not merged yet) to introduce it. If you want to wet your hand, you can create your own branch and merge the PR to your own branch and have a try. |
@jackyq2015 @thadguidry
4.The exported file is garbled. |
I think that it is a character string conversion mistake at export timing, not an encoding discrimination bug at import timing. |
other export pattern |
can you please add the encoding switch I provided above and try again? |
@jackyq2015 configfile openrefine.l4j.ini |
Hopefully this has been fixed, but we should confirm for the 3.4 release. |
From my limited testing, it looks like XLSX export is OK (at least for Numbers on my Mac), CSV is totally corrupted, and HTML broken for the higher code points as shown above. |
@Yuutakasan Sorry for the long delay. The fix for this should make it into 3.4. |
Fixes OpenRefine#1197. Previously we were using a funky ContentType to attempt to force a file download rather than display in browser, but this conflicted with attempts to save UTF-8 which was outside the Basic Multilingual Plane (BMP). By switching to ContentDisposition: attachment, which has been the preferred method for a number of years, we can avoid this conflict. As part of this, switch to using the "preview" param consistently to control preview vs download rather than the content type.
thank you. I'll test. |
) * Use ContentDisposition instead of ContentType to control download Fixes #1197. Previously we were using a funky ContentType to attempt to force a file download rather than display in browser, but this conflicted with attempts to save UTF-8 which was outside the Basic Multilingual Plane (BMP). By switching to ContentDisposition: attachment, which has been the preferred method for a number of years, we can avoid this conflict. As part of this, switch to using the "preview" param consistently to control preview vs download rather than the content type. * Switch content type to text/plain Now that we don't need to use ContentType to control download behavior, we can use something more reasonable.
@tfmorris USE OpenRefine 3.4 beta 1.import-test-sample.txt import ( No more problems than before. ) ①export tsv (Garbled characters) ②export csv (Garbled characters) ③export html (Garbled characters) ④export excel (NOT Garbled characters) ⑤export excel2007+ (NOT Garbled characters) ⑥export ODF SpreadSheet (NOT Garbled characters) ⑦export SQL (NOT Garbled characters) ⑧export SpreadSheet (NOT Garbled characters) |
@Yuutakasan this has not been fixed in 3.4 beta - that version was released before this fix. For a version that we expect not to have the issue, try this one: |
) * Use ContentDisposition instead of ContentType to control download Fixes #1197. Previously we were using a funky ContentType to attempt to force a file download rather than display in browser, but this conflicted with attempts to save UTF-8 which was outside the Basic Multilingual Plane (BMP). By switching to ContentDisposition: attachment, which has been the preferred method for a number of years, we can avoid this conflict. As part of this, switch to using the "preview" param consistently to control preview vs download rather than the content type. * Switch content type to text/plain Now that we don't need to use ContentType to control download behavior, we can use something more reasonable.
@wetneb Thanx. I'll re-test. |
@wetneb @tfmorris USE openrefine-win-3.4-beta-148-gf88c0e3 export openrefine 3.4-beta-148-gf88c0e3.zip export spreadsheet |
Excellent. Thank you very much for testing @Yuutakasan |
OpenRefine 2.7 rc2
After reading UTF 8 file and executing export as UTF 8 file, garbled characters occurred.
displayed characters

有限会社なべ茶屋あさ𡌛
Exported garbled characters

有限会社なべ茶屋あさ����
other garbled export charactor sample
𣘺𣳾
The text was updated successfully, but these errors were encountered: