-
Notifications
You must be signed in to change notification settings - Fork 164
Fix attempting to combine Hangul Jamo 0x11a7 #317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0x11a7 is not a valid Hangul T syllable despite being equal to T_BASE. This is because, per the Unicode spec: TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1 So the first valid Hangul T syllable is 0x11a8. Also see https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59434 for where the spec describes the usage of 0x11a8, not 0x11a7, during composition.
|
Thanks! Can you add a test? I checked that Python3 >>> import unicodedata
>>> s = "\uc8e0\u11a7"
>>> s
'죠ᆧ'
>>> list(map(ord, s))
[51424, 4519]
>>> list(map(ord, unicodedata.normalize('NFC', s)))
[51424, 4519] |
Wow, this is a weird spec. Subtracting TBase gives a 1-based index (which we were incorrectly treating as 0-based, apparently), but you still use |
|
e.g. add something to static void issue317(void) /* #317 */
{
utf8proc_uint8_t input1[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
utf8proc_uint8_t nfc1[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
utf8proc_uint8_t input2[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa8, 0x00}; /* "\uc8e0\u11a8" */
utf8proc_uint8_t nfc2[] = {0xec, 0xa3, 0xa, 0x00}; /* "\uc8e1" */
utf8proc_uint8_t *nfc1_out, *nfc2_out;
nfc1_out = utf8proc_NFC(input1);
printf("NFC \"%s\" -> \"%s\" vs. \"%s\"\n", (char*)input1, (char*)nfc1_out, (char*)nfc1);
check(strlen((char*) nfc1_out) == 6, "incorrect nfc length");
check(!memcmp(nfc1, nfc1_out, 7), "incorrect nfc data");
printf("NFC \"%s\" -> \"%s\" vs. \"%s\"\n", (char*)input2, (char*)nfc2_out, (char*)nfc2);
check(strlen((char*) nfc2_out) == 3, "incorrect nfc length");
check(!memcmp(nfc2, nfc2_out, 4), "incorrect nfc data");
free(nfc2_out); free(nfc1_out);
}(probably this file could use some refactoring to reduce code repetition, but that can be a separate PR — update, will be fixed by #318). |
|
Given #318 you can now do: static void issue317(void) /* #317 */
{
utf8proc_uint8_t input[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
utf8proc_uint8_t combined[] = {0xec, 0xa3, 0xa, 0x00}; /* "\uc8e1" */
utf8proc_int32_t codepoint;
/* inputs that should *not* be combined* */
check_compare("NFC", input, input, utf8proc_NFC(input), 1);
utf8proc_encode_char(0x11c3, input+3);
check_compare("NFC", input, input, utf8proc_NFC(input), 1);
/* inputs that *should* be combined (TCOUNT-1 chars starting at TBASE+1) */
for (codepoint = 0x11a8; codepoint < 0x11c3; ++codepoint) {
utf8proc_encode_char(codepoint, input+3);
utf8proc_encode_char(0xc8e0 + (codepoint - 0x11a7), combined);
check_compare("NFC", input, combined, utf8proc_NFC(input), 1);
}
} |
utf8proc.c
Outdated
| utf8proc_int32_t hangul_tindex; | ||
| hangul_tindex = current_char - UTF8PROC_HANGUL_TBASE; | ||
| if (hangul_tindex >= 0 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) { | ||
| if (hangul_tindex >= 1 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (hangul_tindex >= 1 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) { | |
| if (hangul_tindex > 0 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) { |
|
Is there a similar problem with In particular, it looks like there isn't the same the off-by-1 problem with S characters:
and their sample code checks |
Yeah, agreed. Definitely one of the stranger parts of the spec. I suspect it might have to do with making the math much cleaner for the algorithmic decomposition of Hangul syllables (i.e. perhaps it makes division by To answer your question about I'll go ahead and apply your changes and write a test. Thanks for the review! |
0x11a7is not a valid Hangul T syllable despite being equal toT_BASE. This is because, per the Unicode spec:TCount is set to one more than the number of trailing consonants
relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1
So the first valid Hangul T syllable is
0x11a8. Also see https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59434 for where the spec describes the usage of0x11a8, not0x11a7, during composition.For an example of where this causes a bug in utf8proc, try composing the Unicode string
<0xc8e0, 0x11a7>. The output is<0xc8e0>, when it should be<0xc8e0, 0x11a7>. Reproduction: