Skip to content

Commit

Permalink
Prepare for Unicode 9.0
Browse files Browse the repository at this point in the history
The major code changes needed to support Unicode 9.0 are to changes in
the boundary (break) rules, for things like \b{lb}, \b{wb}.
regen/mk_invlists.pl creates two-dimensional arrays for all these
properties.  To see if a given point in the target string is a break or
not, regexec.c looks up the entry in the property's table whose row
corresponds to the code point before the potential break, and whose
column corresponds to the one after.  Mostly this is completely
determining, but for some cases, extra context is required, and the
array entry indicates this, and there has to be specially crafted code
in regexec.c to handle each such possibility.  When a new release comes
along, mk_invlists.pl has to be changed to handle any new or changed
rules, and regexec.c has to be changed to handle any changes to the
custom code.

Unfortunately this is not a mature area of the Standard, and changes are
fairly common in new releases.  In part, this is because new types of
code points come along, which need new rules.  Sometimes it is because
they realized the previous version didn't work as well as it could.  An
example of the latter is that Unicode now realizes that Regional
Indicator (RI) characters come in pairs, and that one should be able to
break between each pair, but not within a pair.  Previous versions
treated any run of them as unbreakable.  (Regional Indicators are a
fairly recent type that was added to the Standard in 6.0, and things are
still getting shaken out.)

The other main changes to these rules also involve a fairly new type of
character, emojis.  We can expect further changes to these in the next
Unicode releases.

\b{gcb} for the first time, now depends on context (in rarely
encountered cases, like RI's), so the function had to be changed from a
simple table look-up to be more like the functions handling the other
break properties.

Some years ago I revamped mktables in part to try to make it require as
few manual interventions as possible when upgrading to a new version of
Unicode.  For example, a new data file in a release requires telling
mktables about it, but as long as it follows the format of existing
recent files, nothing else need be done to get whatever properties it
describes to be included.

Some of changes to mktables involved guessing, from existing limited
data, what the underlying paradigm for that data was.  The problem with
that is there may not have been a paradigm, just something they did ad
hoc, which can change at will; or I didn't understand their unstated
thinking, and guessed wrong.

Besides the boundary rule changes, the only change that the existing
mktables couldn't cope with was the addition of the Tangut script, whose
character names include the code point, like CJK UNIFIED IDEOGRAPH-3400
has always done.  The paradigm for this wasn't clear, since CJK was the
only script that had this characteristic, and so I hard-coded it into
mktables.  The way Tangut is structured may show that there is a
paradigm emerging (but we only have two examples, and there may not be a
paradigm at all), and so I have guessed one, and changed mktables to
assume this guessed paradigm.  If other scripts like this come along,
and I have guessed correctly, mktables will cope with these
automatically without manual intervention.
  • Loading branch information
khwilliamson committed Jun 22, 2016
1 parent 6295dc1 commit b0e2440
Show file tree
Hide file tree
Showing 11 changed files with 738 additions and 352 deletions.
571 changes: 318 additions & 253 deletions charclass_invlists.h

Large diffs are not rendered by default.

9 changes: 8 additions & 1 deletion embed.fnc
Expand Up @@ -2358,7 +2358,14 @@ Es |void |to_utf8_substr |NN regexp * prog
Es |bool |to_byte_substr |NN regexp * prog
ERsn |I32 |reg_check_named_buff_matched |NN const regexp *rex \
|NN const regnode *scan
EinR |bool |isGCB |const GCB_enum before|const GCB_enum after
EsR |bool |isGCB |const GCB_enum before \
|const GCB_enum after \
|NN const U8 * const strbeg \
|NN const U8 * const curpos \
|const bool utf8_target
EsR |GCB_enum|backup_one_GCB|NN const U8 * const strbeg \
|NN U8 ** curpos \
|const bool utf8_target
EsR |bool |isLB |LB_enum before \
|LB_enum after \
|NN const U8 * const strbeg \
Expand Down
3 changes: 2 additions & 1 deletion embed.h
Expand Up @@ -1111,13 +1111,14 @@
#define advance_one_LB(a,b,c) S_advance_one_LB(aTHX_ a,b,c)
#define advance_one_SB(a,b,c) S_advance_one_SB(aTHX_ a,b,c)
#define advance_one_WB(a,b,c,d) S_advance_one_WB(aTHX_ a,b,c,d)
#define backup_one_GCB(a,b,c) S_backup_one_GCB(aTHX_ a,b,c)
#define backup_one_LB(a,b,c) S_backup_one_LB(aTHX_ a,b,c)
#define backup_one_SB(a,b,c) S_backup_one_SB(aTHX_ a,b,c)
#define backup_one_WB(a,b,c,d) S_backup_one_WB(aTHX_ a,b,c,d)
#define find_byclass(a,b,c,d,e) S_find_byclass(aTHX_ a,b,c,d,e)
#define isFOO_lc(a,b) S_isFOO_lc(aTHX_ a,b)
#define isFOO_utf8_lc(a,b) S_isFOO_utf8_lc(aTHX_ a,b)
#define isGCB S_isGCB
#define isGCB(a,b,c,d,e) S_isGCB(aTHX_ a,b,c,d,e)
#define isLB(a,b,c,d,e,f) S_isLB(aTHX_ a,b,c,d,e,f)
#define isSB(a,b,c,d,e,f) S_isSB(aTHX_ a,b,c,d,e,f)
#define isWB(a,b,c,d,e,f,g) S_isWB(aTHX_ a,b,c,d,e,f,g)
Expand Down
13 changes: 12 additions & 1 deletion lib/Unicode/UCD.pm
Expand Up @@ -5,7 +5,7 @@ use warnings;
no warnings 'surrogate'; # surrogates can be inputs to this
use charnames ();

our $VERSION = '0.65';
our $VERSION = '0.66';

require Exporter;

Expand Down Expand Up @@ -98,6 +98,9 @@ Unicode::UCD - Unicode character database
use Unicode::UCD 'search_invlist';
my $index = search_invlist(\@invlist, $code_point);
# The following function should be used only internally in
# implementations of the Unicode Normalization Algorithm, and there
# are better choices than it.
use Unicode::UCD 'compexcl';
my $compexcl = compexcl($codepoint);
Expand Down Expand Up @@ -1200,6 +1203,12 @@ sub bidi_types {

=head2 B<compexcl()>
WARNING: Unicode discourages the use of this function or any of the
alternative mechanisms listed in this section (the documention of
C<compexcl()>), except internally in implementations of the Unicode
Normalization Algorithm. You should be using L<Unicode::Normalize> directly
instead of these. Using these will likely lead to half-baked results.
use Unicode::UCD 'compexcl';
my $compexcl = compexcl(0x09dc);
Expand Down Expand Up @@ -3044,6 +3053,8 @@ L<Unicode::Normalize::NFD()|Unicode::Normalize>.
Note that the mapping is the one that is specified in the Unicode data files,
and to get the final decomposition, it may need to be applied recursively.
Unicode in fact discourages use of this property except internally in
implementations of the Unicode Normalization Algorithm.
The fourth (index [3]) element (C<$default>) in the list returned for this
format is 0.
Expand Down
26 changes: 20 additions & 6 deletions lib/charnames.t
Expand Up @@ -1009,7 +1009,7 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}", 'V
die "Can't open ../../lib/unicore/UnicodeData.txt: $!";
while (<$fh>) {
chomp;
my ($code, $name, undef, undef, undef, undef, undef, undef, undef, undef, $u1name) = split ";";
my ($code, $name, $category, undef, undef, undef, undef, undef, undef, undef, $u1name) = split ";";
my $decimal = utf8::unicode_to_native(hex $code);
$code = sprintf("%04X", $decimal) unless $::IS_ASCII;

Expand Down Expand Up @@ -1042,12 +1042,26 @@ is("\N{U+1D0C5}", "\N{BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS}", 'V
/^(.*?);/;
my $end_decimal = hex $1;

# Only the CJK (and the Hangul which are instead dealt with below)
# ones have names, and they all have the code point as part of the
# name, which we can construct
if ($name =~ /^<CJK/) {
# Only the ones whose category is a letter currently have names,
# and of those the Hangul Syllables are dealt with below
if ( $category eq 'Lo' && $name !~ /^Hangul/i) {

# The CJK ones all get translated to a particular form; we
# just capitalize any others in the hopes that Unicode will
# use the correct term in any future ones it might add.
if ($name =~ /^<CJK/) {
$name = "CJK UNIFIED IDEOGRAPH";
}
else {
$name =~ s/<//;
$name =~ s/,.*//;
$name = uc($name);
}

# They all have the code point as part of the name, which we
# can construct
for my $i ($decimal .. $end_decimal) {
$names[$i] = sprintf "CJK UNIFIED IDEOGRAPH-%04X", $i;
$names[$i] = sprintf "$name-%04X", $i;
my $block = $i >> $block_size_bits;
$algorithmic_names_count[$block]++;
}
Expand Down
29 changes: 24 additions & 5 deletions lib/unicore/mktables
Expand Up @@ -45,7 +45,7 @@ sub NON_ASCII_PLATFORM { ord("A") != 65 }
# expected, a warning will be generated. If an older version is being
# compiled, any bounds tests that fail in the generated test file (-maketest
# option) will be marked as TODO.
my $version_of_mk_invlist_bounds = v8.0.0;
my $version_of_mk_invlist_bounds = v9.0.0;

##########################################################################
#
Expand Down Expand Up @@ -11741,7 +11741,16 @@ END
. $CMD_DELIM
. $fields[$CHARNAME];
}
elsif ($fields[$CHARNAME] =~ /^CJK/) {
elsif ($fields[$CATEGORY] eq 'Lo') { # Is a letter

# All the CJK ranges like this have the name given as a
# special case in the next code line. And for the others, we
# hope that Unicode continues to use the correct name in
# future releases, so we don't have to make further special
# cases.
my $name = ($fields[$CHARNAME] =~ /^CJK/)
? 'CJK UNIFIED IDEOGRAPH'
: uc $fields[$CHARNAME];

# The name for these contains the code point itself, and all
# are defined to have the same base name, regardless of what
Expand All @@ -11753,7 +11762,7 @@ END
. '='
. $CP_IN_NAME
. $CMD_DELIM
. 'CJK UNIFIED IDEOGRAPH';
. $name;

}
elsif ($fields[$CATEGORY] eq 'Co'
Expand Down Expand Up @@ -19193,7 +19202,8 @@ my @input_file_objects = (
. 'incorporated into the Unicode data base',
),
Input_file->new('StandardizedVariants.html', v3.2.0,
Skip => 'Provides a visual display of the standard '
Skip => 'Obsoleted as of Unicode 9.0, but previously '
. 'provided a visual display of the standard '
. 'variant sequences derived from '
. 'F<StandardizedVariants.txt>.',
# I don't know why the html came earlier than the
Expand Down Expand Up @@ -19407,6 +19417,12 @@ my @input_file_objects = (
Property => 'Indic_Positional_Category',
Has_Missings_Defaults => $NOT_IGNORED,
),
Input_file->new('TangutSources.txt', v9.0.0,
Skip => 'Specifies source mappings for Tangut ideographs'
. ' and components. This data file also includes'
. ' informative radical-stroke values that are used'
. ' internally by Unicode',
),
);

# End of all the preliminaries.
Expand Down Expand Up @@ -19871,7 +19887,10 @@ if (defined &locales_enabled) {
}

# Eval'd so can run on versions earlier than the property is available in
my $WB_Extend_or_Format_re = eval 'qr/[\p{WB=Extend}\p{WB=Format}]/';
my $WB_Extend_or_Format_re = eval 'qr/[\p{WB=Extend}\p{WB=Format}\p{WB=ZWJ}]/';
if (! defined $WB_Extend_or_Format_re) {
$WB_Extend_or_Format_re = eval 'qr/[\p{WB=Extend}\p{WB=Format}]/';
}

sub _test_break($$) {
# Test various break property matches. The 2nd parameter gives the
Expand Down
2 changes: 1 addition & 1 deletion pod/perlrebackslash.pod
Expand Up @@ -623,7 +623,7 @@ If the final space character in the span is a horizontal white space, it
is broken out so that it attaches instead to the combining character.
To be precise, if a span of white space that ends in a horizontal space
has the character immediately following it have either of the Word
Boundary property values "Extend" or "Format", the boundary between the
Boundary property values "Extend", "Format" or "ZWJ", the boundary between the
final horizontal space character and the rest of the span matches
C<\b{wb}>. In all other cases the boundary between two white space
characters matches C<\B{wb}>.)
Expand Down
9 changes: 8 additions & 1 deletion proto.h
Expand Up @@ -5219,6 +5219,11 @@ STATIC WB_enum S_advance_one_WB(pTHX_ U8 ** curpos, const U8 * const strend, con
#define PERL_ARGS_ASSERT_ADVANCE_ONE_WB \
assert(curpos); assert(strend)

STATIC GCB_enum S_backup_one_GCB(pTHX_ const U8 * const strbeg, U8 ** curpos, const bool utf8_target)
__attribute__warn_unused_result__;
#define PERL_ARGS_ASSERT_BACKUP_ONE_GCB \
assert(strbeg); assert(curpos)

STATIC LB_enum S_backup_one_LB(pTHX_ const U8 * const strbeg, U8 ** curpos, const bool utf8_target)
__attribute__warn_unused_result__;
#define PERL_ARGS_ASSERT_BACKUP_ONE_LB \
Expand Down Expand Up @@ -5247,8 +5252,10 @@ STATIC bool S_isFOO_utf8_lc(pTHX_ const U8 classnum, const U8* character)
#define PERL_ARGS_ASSERT_ISFOO_UTF8_LC \
assert(character)

PERL_STATIC_INLINE bool S_isGCB(const GCB_enum before, const GCB_enum after)
STATIC bool S_isGCB(pTHX_ const GCB_enum before, const GCB_enum after, const U8 * const strbeg, const U8 * const curpos, const bool utf8_target)
__attribute__warn_unused_result__;
#define PERL_ARGS_ASSERT_ISGCB \
assert(strbeg); assert(curpos)

STATIC bool S_isLB(pTHX_ LB_enum before, LB_enum after, const U8 * const strbeg, const U8 * const curpos, const U8 * const strend, const bool utf8_target)
__attribute__warn_unused_result__;
Expand Down
4 changes: 2 additions & 2 deletions regcharclass.h
Expand Up @@ -1852,7 +1852,7 @@
#endif /* H_REGCHARCLASS */

/* Generated from:
* 66726fe32be96a422e8c9b45bc9daf61e068d988c99ff41112972ef721365521 lib/Unicode/UCD.pm
* de6076d81bc4e85f179377ded4c68f3b257c8f7990227d4302eca442fda558f8 lib/Unicode/UCD.pm
* ae98bec7e4f0564758eed81eca5015481ba32581f8a735a825b71b3bba714450 lib/unicore/ArabicShaping.txt
* 1687fe5994eb7e5c0dab8503fc2a1b3b479d91af9d3b8055941c9bd791f7d0b5 lib/unicore/BidiBrackets.txt
* 350d1302116194b0b21def287434b55c5088098fbc726e879f7420a391965643 lib/unicore/BidiMirroring.txt
Expand Down Expand Up @@ -1895,7 +1895,7 @@
* 1a0687fb9c6c4567e853913549df0944fe40821279a3e9cdaa6ab8679bc286fd lib/unicore/extracted/DLineBreak.txt
* 40bcfed3ca727c19e1331f6c33806231d5f7eeeabd2e6a9e06a3740c85d0c250 lib/unicore/extracted/DNumType.txt
* a18d502bad39d527ac5586d7bc93e29f565859e3bcc24ada627eff606d6f5fed lib/unicore/extracted/DNumValues.txt
* 45321b549a605b65ead1e83cdb90fdd9c5a6c8731a537197f335bab251b4e778 lib/unicore/mktables
* 4fbcc500e9215a31d39fa3fba793a4c893285e7d19912fc86fa6518120ecc4e1 lib/unicore/mktables
* 462c9aaa608fb2014cd9649af1c5c009485c60b9c8b15b89401fdc10cf6161c6 lib/unicore/version
* 913d2f93f3cb6cdf1664db888bf840bc4eb074eef824e082fceda24a9445e60c regen/charset_translations.pl
* d9c04ac46bdd81bb3e26519f2b8eb6242cb12337205add3f7cf092b0c58dccc4 regen/regcharclass.pl
Expand Down

0 comments on commit b0e2440

Please sign in to comment.