Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composited unicode characters not displayed correctly. #1226

Closed
1 of 3 tasks
vpenades opened this issue Jun 3, 2019 · 11 comments Β· Fixed by #1978
Closed
1 of 3 tasks

Composited unicode characters not displayed correctly. #1226

vpenades opened this issue Jun 3, 2019 · 11 comments Β· Fixed by #1978
Labels
Milestone

Comments

@vpenades
Copy link

vpenades commented Jun 3, 2019

Read and complete the full issue template

Do you want to request a feature or report a bug?

  • Bug
  • Feature

Version of ClosedXML

0.94.2

What is the current behavior?

When I fill a cell with value: C0 πŸ‘ 0/2 βœ–0 ⚠️0

It is displayed like this: C0 _xD83D__xDC4D_ 0/2 βœ–0 ⚠️0 in LibreOffice 6.2.3.2

What is the expected behavior or new feature?

Unicode emoji should display correctly.

Now, I don't know if this is a ClosedXML bug, or a LibreOffice bug. It seems πŸ‘ is a composited unicode character... dunno which are the rules to display this in excel and libreoffice.

Reproducibility

This is an important section. Read it carefully. Failure to do so will cause a 'RTFM' comment.

Without a code sample, it is unlikely that your issue will get attention. Don't be lazy. Do the effort and assist the developers to reproduce your problem. Code samples should be minimal complete and verifiable. Sample spreadsheets should be attached whenever applicable. Remove sensitive information.

Code to reproduce problem:

public void Main()
{
    var book = new XLWorkbook();
    var sheet = book .AddWorksheet("sheet");
    var cell = sheet.Cell(1,1);
    cell.Value ="C0  πŸ‘ 0/2 βœ–0 ⚠️0";
    book.SaveAs("unicode.xlsx");
}
  • I attached a sample spreadsheet. (You can drag files on to this issue)
@igitur
Copy link
Member

igitur commented Jun 3, 2019

So you just deleted the whole reproducibility section of the issue template?

@igitur igitur added the RTFM label Jun 3, 2019
@vpenades
Copy link
Author

vpenades commented Jun 3, 2019

So you just deleted the whole reproducibility section of the issue template?

Well... Since this is purely a string content issue, I thought it was not neccesary... I can add one if it's a requirement.

@yoyos
Copy link

yoyos commented Jan 11, 2023

Have exact problem with 0.100.2

using (var workbook = new XLWorkbook())
            {
                var worksheet = workbook.Worksheets.Add("Sheet1");
                var currentRow = 1;
                worksheet.Cell(1, 1).Value = "Head1";
                worksheet.Cell(2, 1).Value = "πŸ€”";
                using (var stream = new MemoryStream())
                {
                    workbook.SaveAs("test.xlsx");
                }
            }

If I open the xlsx in Excel 2013 and File -> Save As -> Browse
Select Tools -> Web Options
In Encoding Tab, select UTF-8

The new excel file is valid.

@jahav
Copy link
Member

jahav commented Jan 15, 2023

I was rather confused what was the problem, everything was displayed fine for me, both samples generated files that displayed correctly in LibreOffice 7.4 and 6.2.3.2 and in Excel 2016 and Excel 2021. I don't have 2013 version.
image

Relevant spec ECMA-376:

22.9.2.19 ST_Xstring (Escaped String)
For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters
are escaped using the Unicode numerical character representation escape character format xHHHH, where H
represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted
in an XML 1.0 document, so it must be escaped as x0008. end example]

XML 1.0 Spec chapter 2.2

Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode

So after reading all that dry specifications, the conclusion is that at least the smiley is a compound character with a code U+1F914 and that is within the range allowed by XML 1.0 spec and should be just written to the file and not escaped.

@yoyos
Copy link

yoyos commented Jan 15, 2023

Thanks for your help but I didn't change namespaces when using closdxml and inserted the smiley without encoding it.

Actually I even extracted the XLSX archive and opened the specific sheet in notepad++ and saw that the smiley was encoded without asking as : _xHhh_xHhh_ and the XML had the utf8bom encoding.

I generated the exact same excel with openxml lib and it worked first try. Character was not encoded and displayed correctly in notepad++

Should I open a new issue and link a file + code sample ?

@jahav
Copy link
Member

jahav commented Jan 15, 2023

@yoyos I don't think it is a separate issue.

By "should be just written to the file and not escaped" I meant that ClosedXML should just write it out as UTF-8 char without doing anything. It calls XmlEncoder.XmlEncode here but because these long characters are in two chars (char is 2 byte), each half of the char is encoded separately and it shouldn't. Instead, it should recognize that it got a composite character in valid range and just pass it along to encoder.

EDIT:Basically the result should be this (this is file created by excel for UTF-8 encoding):
image
NOT this:
image

@igitur
Copy link
Member

igitur commented Jan 15, 2023

@jahav Did you see the referenced PR and linked issue? As far as I could tell, there isn't an easy way to determine whether the emoji chars need to be escaped, except if you reimplement the logic from that library.

@jahav
Copy link
Member

jahav commented Jan 15, 2023

@igitur Sorry, I missed those. Thanks.

@jahav
Copy link
Member

jahav commented Jan 15, 2023

@igitur Ok, I looked at at I am still confused (don't like PR because it is either another 1MB dependency or a another fork).

Confusion often happens when multiple encoding (String in UTF-16, output in UTF-8 and custom OpenXML requirement for XML encoding).

I think that it can be solved simply by using classic Encoding from BCL and Char.IsHighSurrogate, char.ConvertToUtf32(highSurrogate,lowSurrogate). The only thing that is throwing spanner into works is the requirement to encode non-XML characters (no idea what to do about codepoints from astral plane).

Basically only add char.IsSurrogate to XmlEncoder.EncodeString:

var len = encodeStr.Length;
for (var i = 0; i < len; ++i)
{
	var currentChar = encodeStr[i];
	if (XmlConvert.IsXmlChar(currentChar))
	{
		sb.Append(currentChar);
	}
	else if (char.IsSurrogate(currentChar) && i < len - 1 && XmlConvert.IsXmlSurrogatePair(encodeStr[i + 1], currentChar))
	{
		sb.Append(currentChar);
		sb.Append(encodeStr[++i]);
	}
	else
	{
		sb.Append(XmlConvert.EncodeName(currentChar.ToString()));
	}
}

That is all. The problem stem from passing a a half of a surrogate pair to IsXmlChar that expects a codepoint (to be fair, it doesn't accept int, so it is limited and that is why there is an extra method).

Encoding in the stream (the OpenXmlWriter uses Encoding.UTF8 by default) takes chars from string, detects that there is a surrogate pair of chars in the input and outputs correct byte sequence for a codepoint detected from surrogate. Am I missing something?

image

@jahav
Copy link
Member

jahav commented Jan 17, 2023

@yoyos I made a PR #1978 that should fix the issue, but since I couldn't reproduce the original issue, can you please try the dev version of nuget package from the PR if it solves the problem?

Here is a guide: https://github.com/ClosedXML/ClosedXML/wiki/Development-Builds

Once you confirm, I will merge it the PR #1978 .

@yoyos
Copy link

yoyos commented Jan 18, 2023

Thanks @jahav

With ClosedXML v0.100.3:

<x:si><x:t>Test: C0 _xD83D__xDC4D_ 0/2 βœ–0 ⚠️0 _xD83E__xDD14_</x:t></x:si>

With your PR:

<x:si><x:t>Test: C0 πŸ‘ 0/2 βœ–0 ⚠️0 πŸ€”</x:t></x:si>

@jahav jahav added this to the v0.101 milestone Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants