Fix attempting to combine Hangul Jamo 0x11a7 #317

dzfrias · 2025-11-22T04:40:47Z

0x11a7 is not a valid Hangul T syllable despite being equal to T_BASE. This is because, per the Unicode spec:

TCount is set to one more than the number of trailing consonants
relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1

So the first valid Hangul T syllable is 0x11a8. Also see https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59434 for where the spec describes the usage of 0x11a8, not 0x11a7, during composition.

For an example of where this causes a bug in utf8proc, try composing the Unicode string <0xc8e0, 0x11a7>. The output is <0xc8e0>, when it should be <0xc8e0, 0x11a7>. Reproduction:

#include <stdio.h>
#include <utf8proc.h>

int main() {
  utf8proc_uint8_t input[] = {236, 163, 160, 225, 134, 167};

  utf8proc_uint8_t *normalized = NULL;
  utf8proc_map(input, 6, &normalized, UTF8PROC_STABLE | UTF8PROC_COMPOSE);

  printf("bytes: ");
  for (utf8proc_uint8_t *c = normalized; *c != 0; c++) {
    printf("%u ", *c);
  }
  printf("\n");

  free(normalized);

  return 0;
}

0x11a7 is not a valid Hangul T syllable despite being equal to T_BASE. This is because, per the Unicode spec: TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1 So the first valid Hangul T syllable is 0x11a8. Also see https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59434 for where the spec describes the usage of 0x11a8, not 0x11a7, during composition.

stevengj · 2025-11-22T13:21:17Z

Thanks! Can you add a test? I checked that Python3 unicodedata does the right thing here:

>>> import unicodedata
>>> s = "\uc8e0\u11a7"
>>> s
'죠ᆧ'
>>> list(map(ord, s))
[51424, 4519]
>>> list(map(ord, unicodedata.normalize('NFC', s)))
[51424, 4519]

stevengj · 2025-11-22T13:29:22Z

TBase is set to one less than the beginning of the range of trailing consonants, which starts at U+11A8. TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm

Wow, this is a weird spec. Subtracting TBase gives a 1-based index (which we were incorrectly treating as 0-based, apparently), but you still use < TCount because it is 1 more than the length. I wonder why they did that?

stevengj · 2025-11-22T13:40:07Z

e.g. add something to test/misc.c, like:

static void issue317(void) /* #317 */
{
    utf8proc_uint8_t input1[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
    utf8proc_uint8_t nfc1[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
    utf8proc_uint8_t input2[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa8, 0x00}; /* "\uc8e0\u11a8" */
    utf8proc_uint8_t nfc2[] = {0xec, 0xa3, 0xa, 0x00}; /* "\uc8e1" */
    utf8proc_uint8_t *nfc1_out, *nfc2_out;
    nfc1_out = utf8proc_NFC(input1);
    printf("NFC \"%s\" -> \"%s\" vs. \"%s\"\n", (char*)input1, (char*)nfc1_out, (char*)nfc1);
    check(strlen((char*) nfc1_out) == 6, "incorrect nfc length");
    check(!memcmp(nfc1, nfc1_out, 7), "incorrect nfc data");
    printf("NFC \"%s\" -> \"%s\" vs. \"%s\"\n", (char*)input2, (char*)nfc2_out, (char*)nfc2);
    check(strlen((char*) nfc2_out) == 3, "incorrect nfc length");
    check(!memcmp(nfc2, nfc2_out, 4), "incorrect nfc data");
    free(nfc2_out); free(nfc1_out);
}

(probably this file could use some refactoring to reduce code repetition, but that can be a separate PR — update, will be fixed by #318).

stevengj · 2025-11-22T14:58:24Z

Given #318 you can now do:

static void issue317(void) /* #317 */
{
    utf8proc_uint8_t input[] = {0xec, 0xa3, 0xa0, 0xe1, 0x86, 0xa7, 0x00}; /* "\uc8e0\u11a7" */
    utf8proc_uint8_t combined[] = {0xec, 0xa3, 0xa, 0x00}; /* "\uc8e1" */
    utf8proc_int32_t codepoint;

    /* inputs that should *not* be combined* */
    check_compare("NFC", input, input, utf8proc_NFC(input), 1);
    utf8proc_encode_char(0x11c3, input+3);
    check_compare("NFC", input, input, utf8proc_NFC(input), 1);

    /* inputs that *should* be combined (TCOUNT-1 chars starting at TBASE+1) */
    for (codepoint = 0x11a8; codepoint < 0x11c3; ++codepoint) {
        utf8proc_encode_char(codepoint, input+3);
        utf8proc_encode_char(0xc8e0 + (codepoint - 0x11a7), combined);
        check_compare("NFC", input, combined, utf8proc_NFC(input), 1);
    }
}

stevengj · 2025-11-22T15:00:08Z

utf8proc.c

          utf8proc_int32_t hangul_tindex;
          hangul_tindex = current_char - UTF8PROC_HANGUL_TBASE;
-          if (hangul_tindex >= 0 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {
+          if (hangul_tindex >= 1 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {


Suggested change

if (hangul_tindex >= 1 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {

if (hangul_tindex > 0 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {

stevengj · 2025-11-22T15:00:36Z

Is there a similar problem with UTF8PROC_HANGUL_SCOUNT? Fortunately, it looks okay at first glance.

In particular, it looks like there isn't the same the off-by-1 problem with S characters:

SCount is the total number of precomposed Hangul syllables.

and their sample code checks (SIndex < 0 || SIndex >= SCount) with SIndex = s - SBase to see if s is not in Hangul S, so it looks like that corresponds to what we are doing.

dzfrias · 2025-11-22T15:53:10Z

Wow, this is a weird spec. Subtracting TBase gives a 1-based index (which we were incorrectly treating as 0-based, apparently), but you still use < TCount because it is 1 more than the length. I wonder why they did that?

Yeah, agreed. Definitely one of the stranger parts of the spec. I suspect it might have to do with making the math much cleaner for the algorithmic decomposition of Hangul syllables (i.e. perhaps it makes division by TCount and modulo with TCount meaningful).

To answer your question about SCount, I think we should be okay because there doesn't seem to be a good reason for why there would be an off-by-one, and there also doesn't seem to be any mention of it in the spec after doing a second look-over.

I'll go ahead and apply your changes and write a test. Thanks for the review!

…oc_reencode (JuliaStrings#312)

stevengj reviewed Nov 22, 2025

View reviewed changes

stevengj and others added 4 commits November 22, 2025 08:04

document that utf8proc_map simply wraps utf8proc_decompose and utf8pr…

14f6b93

…oc_reencode (JuliaStrings#312)

test code refactoring (JuliaStrings#318)

164b9f6

Write regression test for JuliaStrings#317

57c7a2d

Merge branch 'master' into hangul_t_fix

519e22c

stevengj approved these changes Nov 22, 2025

View reviewed changes

stevengj merged commit 0260ba5 into JuliaStrings:master Nov 22, 2025
12 checks passed

This was referenced Nov 22, 2025

version bump to 2.11.2 #319

Merged

update utf8proc to 2.11.2 JuliaLang/julia#60207

Open

dzfrias deleted the hangul_t_fix branch November 23, 2025 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix attempting to combine Hangul Jamo 0x11a7 #317

Fix attempting to combine Hangul Jamo 0x11a7 #317

Uh oh!

dzfrias commented Nov 22, 2025

Uh oh!

stevengj commented Nov 22, 2025

Uh oh!

stevengj commented Nov 22, 2025

Uh oh!

stevengj commented Nov 22, 2025 •

edited

Loading

Uh oh!

stevengj commented Nov 22, 2025

Uh oh!

stevengj Nov 22, 2025

Uh oh!

stevengj commented Nov 22, 2025 •

edited

Loading

Uh oh!

dzfrias commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (hangul_tindex >= 1 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {
	if (hangul_tindex > 0 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {

Fix attempting to combine Hangul Jamo 0x11a7 #317

Fix attempting to combine Hangul Jamo 0x11a7 #317

Uh oh!

Conversation

dzfrias commented Nov 22, 2025

Uh oh!

stevengj commented Nov 22, 2025

Uh oh!

stevengj commented Nov 22, 2025

Uh oh!

stevengj commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevengj commented Nov 22, 2025

Uh oh!

stevengj Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

stevengj commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dzfrias commented Nov 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stevengj commented Nov 22, 2025 •

edited

Loading

stevengj commented Nov 22, 2025 •

edited

Loading