Skip to content

Commit

Permalink
Add Unicode property wildcards
Browse files Browse the repository at this point in the history
  • Loading branch information
khwilliamson committed Mar 12, 2019
1 parent 2cd613e commit 1532347
Show file tree
Hide file tree
Showing 6 changed files with 345 additions and 3 deletions.
16 changes: 16 additions & 0 deletions pod/perldelta.pod
Expand Up @@ -27,6 +27,22 @@ here, but most should go in the L</Performance Enhancements> section.

[ List each enhancement as a =head2 entry ]

=head2 Wildcards in Unicode property value specifications are now
partially supported

You can now do something like this in a regular expression pattern

qr! \p{nv= /(?x) \A [0-5] \z / }!

which matches all Unicode code points which have numeric value is
between 0 and 5 inclusive.

This marks another step in implementing the regular expression features
the Unicode Consortium suggests.

Most properties are supported, with the remainder planned for 5.32.
Details are in L<perlunicode/Wildcards in Property Values>.

=head2 Unicode 12.0 is supported

For details, see L<https://www.unicode.org/versions/Unicode12.0.0/>.
Expand Down
20 changes: 20 additions & 0 deletions pod/perldiag.pod
Expand Up @@ -4244,6 +4244,12 @@ earlier as an attempt to close an unopened filehandle.
not recognized. Say C<kill -l> in your shell to see the valid signal
names on your system.

=item No Unicode property value wildcard matches:

(W regexp) You specified a wildcard for a Unicode property value, but
there is no property value in the current Unicode release that matches
it. Check your spelling.

=item Not a CODE reference

(F) Perl was trying to evaluate a reference to a code value (that is, a
Expand Down Expand Up @@ -6172,6 +6178,12 @@ linkhood if the last stat that wrote to the stat buffer already went
past the symlink to get to the real file. Use an actual filename
instead.

=item The Unicode property wildcards feature is experimental

(S experimental::uniprop_wildcards) This feature is experimental
and its behavior may in any future release of perl. See
L<perlunicode/Wildcards in Property Values>.

=item The 'unique' attribute may only be applied to 'our' variables

(F) This attribute was never supported on C<my> or C<sub> declarations.
Expand Down Expand Up @@ -6711,6 +6723,14 @@ This is not really a "severe" error, but it is supposed to be
raised by default even if warnings are not enabled, and currently
the only way to do that in Perl is to mark it as serious.

=item Unicode property wildcard not terminated

(F) A Unicode property wildcard looks like a delimited regular
expression pattern (all within the braces of the enclosing C<\p{...}>.
The closing delimtter to match the opening one was not found. If the
opening one is escaped by preceding it with a backslash, the closing one
must also be so escaped.

=item Unicode surrogate U+%X is illegal in UTF-8

(S surrogate) You had a UTF-16 surrogate in a context where they are
Expand Down
2 changes: 1 addition & 1 deletion pod/perlre.pod
Expand Up @@ -1019,7 +1019,7 @@ See L<perlrecharclass/POSIX Character Classes> for details.

=item [3]

See L<perlrecharclass/Backslash sequences> for details.
See L<perlunicode/Unicode Character Properties> for details

=item [4]

Expand Down
3 changes: 3 additions & 0 deletions pod/perlrecharclass.pod
Expand Up @@ -405,6 +405,9 @@ non-Unicode code points. This could be somewhat surprising:
Even though these two matches might be thought of as complements, until
v5.20 they were so only on Unicode code points.

Starting in perl v5.30, wildcards are allowed in Unicode property
values. See L<perlunicode/Wildcards in Property Values>.

=head4 Examples

"a" =~ /\w/ # Match, "a" is a 'word' character.
Expand Down
146 changes: 144 additions & 2 deletions pod/perlunicode.pod
Expand Up @@ -921,6 +921,145 @@ L<perlrecharclass/POSIX Character Classes>.

=back

=head2 Wildcards in Property Values

Starting in Perl 5.30, it is possible to do do something like this:

qr!\p{numeric_value=/\A[0-5]\z/}!

or, by abbreviating and adding C</x>,

qr! \p{nv= /(?x) \A [0-5] \z / }!

This matches all code points whose numeric value is one of 0, 1, 2, 3,
4, or 5. This particular example could instead have been written as

qr! \A [ \p{nv=0}\p{nv=1}\p{nv=2}\p{nv=3}\p{nv=4}\p{nv=5} ] \z !xx

in earlier perls, so in this case this feature just makes things easier
and shorter to write. If we hadn't included the C<\A> and C<\z>, these
would have matched things like C<1E<sol>2> because that contains a 1 (as
well as a 2). As written, it matches things like subscripts that have
these numeric values. If we only wanted the decimal digits with those
numeric values, we could say,

qr! (?[ \d & \p{nv=/[0-5]/ ]) }!x

The C<\d> gets rid of needing to anchor the pattern, since it forces the
result to only match C<[0-9]>, and the C<[0-5]> further restricts it.

The text in the above examples enclosed between the C<"E<sol>">
characters can be just about any regular expression. It is independent
of the main pattern, so doesn't share any capturing groups, I<etc>. The
delimiters for it must be ASCII punctuation, but it may NOT be
delimited by C<"{">, nor C<"}"> nor contain a literal C<"}">, as that
delimits the end of the enclosing C<\p{}>. Like any pattern, certain
other delimiters are terminated by their mirror images. These are
C<"(">, C<"[>", and C<"E<lt>">. If the delimiter is any of C<"-">,
C<"_">, C<"+">, or C<"\">, or is the same delimiter as is used for the
enclosing pattern, it must be be preceded by a backslash escape, both
fore and aft.

Beware of using C<"$"> to indicate to match the end of the string. It
can too easily be interpreted as being a punctuation variable, like
C<$/>.

No modifiers may follow the final delimiter. Instead, use
L<perlre/(?adlupimnsx-imnsx)> and/or
L<perlre/(?adluimnsx-imnsx:pattern)> to specify modifiers.

This feature is not available when the left-hand side is prefixed by
C<Is_>, nor for any form that is marked as "Discouraged" in
L<perluniprops/Discouraged>.

Perl wraps your pattern with C<(?iaa: ... )>. This is because nothing
outside ASCII can match the Unicode property values available in this
release, and they should match caselessly. If your pattern has a syntax
error, this wrapping will be shown in the error message, even though you
didn't specify it yourself. This could be confusing if you don't know
about this.

This experimental feature has been added to begin to implement
L<https://www.unicode.org/reports/tr18/#Wildcard_Properties>. Using it
will raise a (default-on) warning in the
C<experimental::uniprop_wildcards> category. We reserve the right to
change its operation as we gain experience.

Your subpattern can be just about anything, but for it to have some
utility, it should match when called with either or both of
a) the full name of the property value with underscores (and/or spaces
in the Block property) and some things uppercase; or b) the property
value in all lowercase with spaces and underscores squeezed out. For
example,

qr!\p{Blk=/Old I.*/}!
qr!\p{Blk=/oldi.*/}!

would match the same things.

A warning is issued if none of the legal values for a property are
matched by your pattern. It's likely that a future release will raise a
warning if your pattern ends up causing every possible code point to
match.

Another example that shows that within C<\p{...}>, C</x> isn't needed to
have spaces:

qr!\p{scx= /Hebrew|Greek/ }!

To be safe, we should have anchored the above example, to prevent
matches for something like C<Hebrew_Braile>, but there aren't
any script names like that.

There are certain properties that it doesn't currently work with. These
are:

Bidi Mirroring Glyph
Bidi Paired Bracket
Case Folding
Decomposition Mapping
Equivalent Unified Ideograph
Name
Name Alias
Lowercase Mapping
NFKC Case Fold
Titlecase Mapping
Uppercase Mapping

Nor is the C<@I<unicode_property>@> form implemented.

Here's a complete example of matching IPV4 internet protocol addresses
in any (single) script

no warnings 'experimental::script_run';
no warnings 'experimental::regex_sets';
no warnings 'experimental::uniprop_wildcards';

# Can match a substring, so this intermediate regex needs to have
# context or anchoring in its final use. Using nt=de yields decimal
# digits. When specifying a subset of these, we must include \d to
# prevent things like U+00B2 SUPERSCRIPT TWO from matching
my $zero_through_255 =
qr/ \b (*sr: # All from same sript
(?[ \p{nv=0} & \d ])* # Optional leading zeros
( # Then one of:
\d{1,2} # 0 - 99
| (?[ \p{nv=1} & \d ]) \d{2} # 100 - 199
| (?[ \p{nv=2} & \d ])
( (?[ \p{nv=:[0-4]:} & \d ]) \d # 200 - 249
| (?[ \p{nv=5} & \d ])
(?[ \p{nv=:[0-5]:} & \d ]) # 250 - 255
)
)
)
\b
/x;

my $ipv4 = qr/ \A (*sr: $zero_through_255
(?: [.] $zero_through_255 ) {3}
)
\z
/x;

=head2 User-Defined Character Properties

Expand Down Expand Up @@ -1220,7 +1359,7 @@ C<U+10FFFF> but also beyond C<U+10FFFF>
RL2.3 Default Word Boundaries - Done [11]
RL2.4 Default Case Conversion - Done
RL2.5 Name Properties - Done
RL2.6 Wildcard Properties - Missing
RL2.6 Wildcards in Property Values - Partial [12]
RL2.7 Full Properties - Done

=over 4
Expand All @@ -1239,6 +1378,9 @@ Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode".
=item [11] see
L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>,

=item [12] see
L</Wildcards in Property Values> above.

=back

=head3 Level 3 - Tailored Support
Expand Down Expand Up @@ -1272,7 +1414,7 @@ portion.
Perl has user-defined properties (L</"User-Defined Character
Properties">) to look at single code points in ways beyond Unicode, and
it might be possible, though probably not very clean, to use code blocks
and things like C<(?(DEFINE)...)> (see L<perlre> to do more specialized
and things like C<(?(DEFINE)...)> (see L<perlre>) to do more specialized
matching.

=back
Expand Down

0 comments on commit 1532347

Please sign in to comment.