Composited unicode characters not displayed correctly. #1226

vpenades · 2019-06-03T09:06:21Z

Read and complete the full issue template

Do you want to request a feature or report a bug?

Bug
Feature

Version of ClosedXML

0.94.2

What is the current behavior?

When I fill a cell with value: C0 👍 0/2 ✖0 ⚠️0

It is displayed like this: C0 _xD83D__xDC4D_ 0/2 ✖0 ⚠️0 in LibreOffice 6.2.3.2

What is the expected behavior or new feature?

Unicode emoji should display correctly.

Now, I don't know if this is a ClosedXML bug, or a LibreOffice bug. It seems 👍 is a composited unicode character... dunno which are the rules to display this in excel and libreoffice.

Reproducibility

This is an important section. Read it carefully. Failure to do so will cause a 'RTFM' comment.

Without a code sample, it is unlikely that your issue will get attention. Don't be lazy. Do the effort and assist the developers to reproduce your problem. Code samples should be minimal complete and verifiable. Sample spreadsheets should be attached whenever applicable. Remove sensitive information.

Code to reproduce problem:

public void Main()
{
    var book = new XLWorkbook();
    var sheet = book .AddWorksheet("sheet");
    var cell = sheet.Cell(1,1);
    cell.Value ="C0  👍 0/2 ✖0 ⚠️0";
    book.SaveAs("unicode.xlsx");
}

I attached a sample spreadsheet. (You can drag files on to this issue)

The text was updated successfully, but these errors were encountered:

igitur · 2019-06-03T09:11:26Z

So you just deleted the whole reproducibility section of the issue template?

vpenades · 2019-06-03T09:15:23Z

So you just deleted the whole reproducibility section of the issue template?

Well... Since this is purely a string content issue, I thought it was not neccesary... I can add one if it's a requirement.

yoyos · 2023-01-11T15:27:13Z

Have exact problem with 0.100.2

using (var workbook = new XLWorkbook())
            {
                var worksheet = workbook.Worksheets.Add("Sheet1");
                var currentRow = 1;
                worksheet.Cell(1, 1).Value = "Head1";
                worksheet.Cell(2, 1).Value = "🤔";
                using (var stream = new MemoryStream())
                {
                    workbook.SaveAs("test.xlsx");
                }
            }

If I open the xlsx in Excel 2013 and File -> Save As -> Browse
Select Tools -> Web Options
In Encoding Tab, select UTF-8

The new excel file is valid.

jahav · 2023-01-15T02:35:16Z

I was rather confused what was the problem, everything was displayed fine for me, both samples generated files that displayed correctly in LibreOffice 7.4 and 6.2.3.2 and in Excel 2016 and Excel 2021. I don't have 2013 version.

Relevant spec ECMA-376:

22.9.2.19 ST_Xstring (Escaped String)
For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters
are escaped using the Unicode numerical character representation escape character format xHHHH, where H
represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted
in an XML 1.0 document, so it must be escaped as x0008. end example]

XML 1.0 Spec chapter 2.2

Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode

So after reading all that dry specifications, the conclusion is that at least the smiley is a compound character with a code U+1F914 and that is within the range allowed by XML 1.0 spec and should be just written to the file and not escaped.

yoyos · 2023-01-15T07:09:49Z

Thanks for your help but I didn't change namespaces when using closdxml and inserted the smiley without encoding it.

Actually I even extracted the XLSX archive and opened the specific sheet in notepad++ and saw that the smiley was encoded without asking as : _xHhh_xHhh_ and the XML had the utf8bom encoding.

I generated the exact same excel with openxml lib and it worked first try. Character was not encoded and displayed correctly in notepad++

Should I open a new issue and link a file + code sample ?

jahav · 2023-01-15T12:07:09Z

@yoyos I don't think it is a separate issue.

By "should be just written to the file and not escaped" I meant that ClosedXML should just write it out as UTF-8 char without doing anything. It calls XmlEncoder.XmlEncode here but because these long characters are in two chars (char is 2 byte), each half of the char is encoded separately and it shouldn't. Instead, it should recognize that it got a composite character in valid range and just pass it along to encoder.

EDIT:Basically the result should be this (this is file created by excel for UTF-8 encoding):

NOT this:

igitur · 2023-01-15T13:32:03Z

@jahav Did you see the referenced PR and linked issue? As far as I could tell, there isn't an easy way to determine whether the emoji chars need to be escaped, except if you reimplement the logic from that library.

jahav · 2023-01-15T13:44:05Z

@igitur Sorry, I missed those. Thanks.

jahav · 2023-01-15T16:52:17Z

@igitur Ok, I looked at at I am still confused (don't like PR because it is either another 1MB dependency or a another fork).

Confusion often happens when multiple encoding (String in UTF-16, output in UTF-8 and custom OpenXML requirement for XML encoding).

I think that it can be solved simply by using classic Encoding from BCL and Char.IsHighSurrogate, char.ConvertToUtf32(highSurrogate,lowSurrogate). The only thing that is throwing spanner into works is the requirement to encode non-XML characters (no idea what to do about codepoints from astral plane).

Basically only add char.IsSurrogate to XmlEncoder.EncodeString:

var len = encodeStr.Length;
for (var i = 0; i < len; ++i)
{
	var currentChar = encodeStr[i];
	if (XmlConvert.IsXmlChar(currentChar))
	{
		sb.Append(currentChar);
	}
	else if (char.IsSurrogate(currentChar) && i < len - 1 && XmlConvert.IsXmlSurrogatePair(encodeStr[i + 1], currentChar))
	{
		sb.Append(currentChar);
		sb.Append(encodeStr[++i]);
	}
	else
	{
		sb.Append(XmlConvert.EncodeName(currentChar.ToString()));
	}
}

That is all. The problem stem from passing a a half of a surrogate pair to IsXmlChar that expects a codepoint (to be fair, it doesn't accept int, so it is limited and that is why there is an extra method).

Encoding in the stream (the OpenXmlWriter uses Encoding.UTF8 by default) takes chars from string, detects that there is a surrogate pair of chars in the input and outputs correct byte sequence for a codepoint detected from surrogate. Am I missing something?

jahav · 2023-01-17T23:15:44Z

@yoyos I made a PR #1978 that should fix the issue, but since I couldn't reproduce the original issue, can you please try the dev version of nuget package from the PR if it solves the problem?

Here is a guide: https://github.com/ClosedXML/ClosedXML/wiki/Development-Builds

Once you confirm, I will merge it the PR #1978 .

yoyos · 2023-01-18T10:32:01Z

Thanks @jahav

With ClosedXML v0.100.3:

<x:si><x:t>Test: C0 _xD83D__xDC4D_ 0/2 ✖0 ⚠️0 _xD83E__xDD14_</x:t></x:si>

With your PR:

<x:si><x:t>Test: C0 👍 0/2 ✖0 ⚠️0 🤔</x:t></x:si>

igitur added the RTFM label Jun 3, 2019

igitur mentioned this issue Jun 12, 2019

Add support for Emoji via Unicode.net library #1231

Closed

igitur added this to the v0.96 milestone Jun 13, 2019

igitur mentioned this issue Feb 21, 2021

Split into 2 packages neosmart/unicode.net#16

Open

igitur removed this from the v0.96 milestone Jun 29, 2022

jahav mentioned this issue Jan 17, 2023

Output unicode codepoints from astral planes #1978

Merged

jahav closed this as completed in #1978 Jan 18, 2023

jahav added this to the v0.101 milestone Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composited unicode characters not displayed correctly. #1226

Composited unicode characters not displayed correctly. #1226

vpenades commented Jun 3, 2019 •

edited

igitur commented Jun 3, 2019

vpenades commented Jun 3, 2019

yoyos commented Jan 11, 2023

jahav commented Jan 15, 2023

yoyos commented Jan 15, 2023 •

edited

jahav commented Jan 15, 2023 •

edited

igitur commented Jan 15, 2023 •

edited

jahav commented Jan 15, 2023

jahav commented Jan 15, 2023 •

edited

jahav commented Jan 17, 2023

yoyos commented Jan 18, 2023

Composited unicode characters not displayed correctly. #1226

Composited unicode characters not displayed correctly. #1226

Comments

vpenades commented Jun 3, 2019 • edited

Read and complete the full issue template

Reproducibility

igitur commented Jun 3, 2019

vpenades commented Jun 3, 2019

yoyos commented Jan 11, 2023

jahav commented Jan 15, 2023

yoyos commented Jan 15, 2023 • edited

jahav commented Jan 15, 2023 • edited

igitur commented Jan 15, 2023 • edited

jahav commented Jan 15, 2023

jahav commented Jan 15, 2023 • edited

jahav commented Jan 17, 2023

yoyos commented Jan 18, 2023

vpenades commented Jun 3, 2019 •

edited

yoyos commented Jan 15, 2023 •

edited

jahav commented Jan 15, 2023 •

edited

igitur commented Jan 15, 2023 •

edited

jahav commented Jan 15, 2023 •

edited