Skip to content

Conversation

@dzfrias
Copy link
Contributor

@dzfrias dzfrias commented Nov 22, 2025

0x11a7 is not a valid Hangul T syllable despite being equal to T_BASE. This is because, per the Unicode spec:

TCount is set to one more than the number of trailing consonants
relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1

So the first valid Hangul T syllable is 0x11a8. Also see https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59434 for where the spec describes the usage of 0x11a8, not 0x11a7, during composition.

For an example of where this causes a bug in utf8proc, try composing the Unicode string <0xc8e0, 0x11a7>. The output is <0xc8e0>, when it should be <0xc8e0, 0x11a7>. Reproduction:

#include <stdio.h>
#include <utf8proc.h>

int main() {
  utf8proc_uint8_t input[] = {236, 163, 160, 225, 134, 167};

  utf8proc_uint8_t *normalized = NULL;
  utf8proc_map(input, 6, &normalized, UTF8PROC_STABLE | UTF8PROC_COMPOSE);

  printf("bytes: ");
  for (utf8proc_uint8_t *c = normalized; *c != 0; c++) {
    printf("%u ", *c);
  }
  printf("\n");

  free(normalized);

  return 0;
}

0x11a7 is not a valid Hangul T syllable despite being equal to T_BASE.
This is because, per the Unicode spec:

  TCount is set to one more than the number of trailing consonants
  relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1

So the first valid Hangul T syllable is 0x11a8. Also see
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59434
for where the spec describes the usage of 0x11a8, not 0x11a7, during
composition.
@stevengj
Copy link
Member

Thanks! Can you add a test? I checked that Python3 unicodedata does the right thing here:

>>> import unicodedata
>>> s = "\uc8e0\u11a7"
>>> s
'죠ᆧ'
>>> list(map(ord, s))
[51424, 4519]
>>> list(map(ord, unicodedata.normalize('NFC', s)))
[51424, 4519]

@stevengj
Copy link
Member

TBase is set to one less than the beginning of the range of trailing consonants, which starts at U+11A8. TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm

Wow, this is a weird spec. Subtracting TBase gives a 1-based index (which we were incorrectly treating as 0-based, apparently), but you still use < TCount because it is 1 more than the length. I wonder why they did that?

@stevengj
Copy link
Member

stevengj commented Nov 22, 2025

e.g. add something to test/misc.c, like:

static void issue317(void) /* #317 */
{
    utf8proc_uint8_t input1[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
    utf8proc_uint8_t nfc1[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
    utf8proc_uint8_t input2[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa8, 0x00}; /* "\uc8e0\u11a8" */
    utf8proc_uint8_t nfc2[] = {0xec, 0xa3, 0xa, 0x00}; /* "\uc8e1" */
    utf8proc_uint8_t *nfc1_out, *nfc2_out;
    nfc1_out = utf8proc_NFC(input1);
    printf("NFC \"%s\" -> \"%s\" vs. \"%s\"\n", (char*)input1, (char*)nfc1_out, (char*)nfc1);
    check(strlen((char*) nfc1_out) == 6, "incorrect nfc length");
    check(!memcmp(nfc1, nfc1_out, 7), "incorrect nfc data");
    printf("NFC \"%s\" -> \"%s\" vs. \"%s\"\n", (char*)input2, (char*)nfc2_out, (char*)nfc2);
    check(strlen((char*) nfc2_out) == 3, "incorrect nfc length");
    check(!memcmp(nfc2, nfc2_out, 4), "incorrect nfc data");
    free(nfc2_out); free(nfc1_out);
}

(probably this file could use some refactoring to reduce code repetition, but that can be a separate PR — update, will be fixed by #318).

@stevengj
Copy link
Member

Given #318 you can now do:

static void issue317(void) /* #317 */
{
    utf8proc_uint8_t input[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
    utf8proc_uint8_t combined[] = {0xec, 0xa3, 0xa, 0x00}; /* "\uc8e1" */
    utf8proc_int32_t codepoint;

    /* inputs that should *not* be combined* */
    check_compare("NFC", input, input, utf8proc_NFC(input), 1);
    utf8proc_encode_char(0x11c3, input+3);
    check_compare("NFC", input, input, utf8proc_NFC(input), 1);

    /* inputs that *should* be combined (TCOUNT-1 chars starting at TBASE+1) */
    for (codepoint = 0x11a8; codepoint < 0x11c3; ++codepoint) {
        utf8proc_encode_char(codepoint, input+3);
        utf8proc_encode_char(0xc8e0 + (codepoint - 0x11a7), combined);
        check_compare("NFC", input, combined, utf8proc_NFC(input), 1);
    }
}

utf8proc.c Outdated
utf8proc_int32_t hangul_tindex;
hangul_tindex = current_char - UTF8PROC_HANGUL_TBASE;
if (hangul_tindex >= 0 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {
if (hangul_tindex >= 1 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (hangul_tindex >= 1 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {
if (hangul_tindex > 0 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {

@stevengj
Copy link
Member

stevengj commented Nov 22, 2025

Is there a similar problem with UTF8PROC_HANGUL_SCOUNT? Fortunately, it looks okay at first glance.

In particular, it looks like there isn't the same the off-by-1 problem with S characters:

SCount is the total number of precomposed Hangul syllables.

and their sample code checks (SIndex < 0 || SIndex >= SCount) with SIndex = s - SBase to see if s is not in Hangul S, so it looks like that corresponds to what we are doing.

@dzfrias
Copy link
Contributor Author

dzfrias commented Nov 22, 2025

Wow, this is a weird spec. Subtracting TBase gives a 1-based index (which we were incorrectly treating as 0-based, apparently), but you still use < TCount because it is 1 more than the length. I wonder why they did that?

Yeah, agreed. Definitely one of the stranger parts of the spec. I suspect it might have to do with making the math much cleaner for the algorithmic decomposition of Hangul syllables (i.e. perhaps it makes division by TCount and modulo with TCount meaningful).

To answer your question about SCount, I think we should be okay because there doesn't seem to be a good reason for why there would be an off-by-one, and there also doesn't seem to be any mention of it in the spec after doing a second look-over.

I'll go ahead and apply your changes and write a test. Thanks for the review!

@stevengj stevengj merged commit 0260ba5 into JuliaStrings:master Nov 22, 2025
12 checks passed
@dzfrias dzfrias deleted the hangul_t_fix branch November 23, 2025 00:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants