Unicode 15.1 support #253

stevengj · 2023-10-18T18:02:40Z

Support for Unicode 15.1, which means updating the tables but also adding a new rule to the grapheme-break algorithm to account for the new Indic_Conjunct_Break property. Fixes #252

Currently a work-in-progress. To do:

get the grapheme test passing

Update: should be ready now.

Similar to #47392, support [Unicode 15.1](https://www.unicode.org/versions/Unicode15.1.0/) by bumping utf8proc to 2.9.0 (JuliaStrings/utf8proc#253). This allows us to use [118 exciting new emoji characters](https://blog.emojipedia.org/whats-new-in-unicode-15-1-and-emoji-15-1/) as identifiers, including "edible mushroom" `"\U1f344\u200d\U1f7eb"` (but still no superscript "q"). Interestingly, they also updated the [Unicode recommendations on programming-language identifiers (UAX#31)](https://www.unicode.org/reports/tr31/tr31-39.html#Mathematical_Compatibility_Notation_Profile) to finally "bless" identifiers beginning with `∂` and `∇` and/or ending with numeric sub/superscripts. They still don't recommend nearly the range of identifiers accepted by Julia, however.

PallHaraldsson · 2023-10-24T16:06:21Z

Do you think this, i.e. the Julialang PR JuliaLang/julia#51799
should be backported to 1.10, in case it will become LTS, or at least by then?

I reviewed the code here, which is rather small (and that PR trivial), except for the Ruby generator that I think I don't need to scrutinize, and it seems safe/preferred to backport, though I did not look at utf8proc_data.c since it's quite large (and generated?).

I don't think your change is a breaking change, but I'm not sure.

The repertoire addition consists almost entirely of urgently needed CJK ideographs, synchronized with planned additions to the Chinese national standard, GB 18030. The remaining additions to the repertoire extend the set of ideographic description characters, to better enable description of unusual CJK ideographs.

Because of, at Wikipedia:

The updated standard GB18030-2022, is incompatible[how?], and it had an enforcement date of 1 August 2023.[3] It has been implemented ICU 73.2; and in Java 21,[4] and backported to older Java 8, 11, 17 (LTS releases) and 20.0.2.[5]

Also:

Major updates were made to UAX #9, Unicode Bidirectional Algorithm, UAX #31, Unicode Identifiers and Syntax, and UTS #39, Unicode Security Mechanisms, to coordinate with the publication of an important new Unicode Technical Standard: UTS #55, Unicode Source Code Handling.

* Segmentation rule changes, most notably:
a. Support was added to line breaking (UAX #14, Unicode Line Breaking Algorithm) for orthographic syllables in a number of South and Southeast Asian writing systems.
b. Grapheme cluster breaking (UAX #29, Unicode Text Segmentation) has adopted the aksara cluster behavior for six scripts. That cluster breaking behavior had previously been widely available via CLDR and ICU.
c. These changes involved significant character property updates.

What I find likely breaking about regarding GB18030-2022 and thus I think Unicode 15.1 (but not at the level of utf8proc?)::

CJK/Unihan Changes

Seven old provisional properties have been removed.

Six new provisional properties have been added.

stevengj · 2023-10-24T17:02:06Z

It's not a breaking change, I think (mainly just adding new characters, and tweaking some grapheme-break rules), but it's a new feature and thus probably not eligible for backport.

stevengj added 7 commits October 18, 2023 14:00

Unicode 15.1 support

8c7149b

always update state

4085574

fix GB9c logic

8c7c5a3

print indic_conjunct_break in printproperty

58c7c83

fix grapheme test

fe29f03

update utf8proc_decompose_char docs

bae57f3

more GB9c tests

527ae3d

stevengj merged commit 46a442b into master Oct 20, 2023
12 checks passed

stevengj deleted the unicode15.1 branch October 20, 2023 20:25

stevengj mentioned this pull request Oct 20, 2023

support Unicode 15.1 via utf8proc 2.9.0 JuliaLang/julia#51799

Merged

stevengj mentioned this pull request Oct 26, 2023

update utf8proc to 2.9.0 JuliaPackaging/Yggdrasil#7578

Merged

DonKult mentioned this pull request Oct 29, 2023

Support Unicode 15.1 new GB9c break rule ycm-core/ycmd#1718

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode 15.1 support #253

Unicode 15.1 support #253

stevengj commented Oct 18, 2023 •

edited

PallHaraldsson commented Oct 24, 2023 •

edited

stevengj commented Oct 24, 2023

Unicode 15.1 support #253

Unicode 15.1 support #253

Conversation

stevengj commented Oct 18, 2023 • edited

PallHaraldsson commented Oct 24, 2023 • edited

stevengj commented Oct 24, 2023

stevengj commented Oct 18, 2023 •

edited

PallHaraldsson commented Oct 24, 2023 •

edited