Add Unicode property wildcards

Perl · Mar 12, 2019 · 1532347 · 1532347
1 parent 2cd613e
commit 1532347
Show file tree

Hide file tree

Showing 6 changed files with 345 additions and 3 deletions.
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
@@ -27,6 +27,22 @@ here, but most should go in the L</Performance Enhancements> section.
 
 [ List each enhancement as a =head2 entry ]
 
+=head2 Wildcards in Unicode property value specifications are now
+partially supported
+
+You can now do something like this in a regular expression pattern
+
+ qr! \p{nv= /(?x) \A [0-5] \z / }!
+
+which matches all Unicode code points which have numeric value is
+between 0 and 5 inclusive.
+
+This marks another step in implementing the regular expression features
+the Unicode Consortium suggests.
+
+Most properties are supported, with the remainder planned for 5.32.
+Details are in L<perlunicode/Wildcards in Property Values>.
+
 =head2 Unicode 12.0 is supported
 
 For details, see L<https://www.unicode.org/versions/Unicode12.0.0/>.

diff --git a/pod/perldiag.pod b/pod/perldiag.pod
@@ -4244,6 +4244,12 @@ earlier as an attempt to close an unopened filehandle.
 not recognized.  Say C<kill -l> in your shell to see the valid signal
 names on your system.
 
+=item No Unicode property value wildcard matches:
+
+(W regexp) You specified a wildcard for a Unicode property value, but
+there is no property value in the current Unicode release that matches
+it.  Check your spelling.
+
 =item Not a CODE reference
 
 (F) Perl was trying to evaluate a reference to a code value (that is, a
@@ -6172,6 +6178,12 @@ linkhood if the last stat that wrote to the stat buffer already went
 past the symlink to get to the real file.  Use an actual filename
 instead.
 
+=item The Unicode property wildcards feature is experimental
+
+(S experimental::uniprop_wildcards) This feature is experimental
+and its behavior may in any future release of perl.  See
+L<perlunicode/Wildcards in Property Values>.
+
 =item The 'unique' attribute may only be applied to 'our' variables
 
 (F) This attribute was never supported on C<my> or C<sub> declarations.
@@ -6711,6 +6723,14 @@ This is not really a "severe" error, but it is supposed to be
 raised by default even if warnings are not enabled, and currently
 the only way to do that in Perl is to mark it as serious.
 
+=item Unicode property wildcard not terminated
+
+(F) A Unicode property wildcard looks like a delimited regular
+expression pattern (all within the braces of the enclosing C<\p{...}>.
+The closing delimtter to match the opening one was not found.  If the
+opening one is escaped by preceding it with a backslash, the closing one
+must also be so escaped.
+
 =item Unicode surrogate U+%X is illegal in UTF-8
 
 (S surrogate) You had a UTF-16 surrogate in a context where they are

diff --git a/pod/perlre.pod b/pod/perlre.pod
@@ -1019,7 +1019,7 @@ See L<perlrecharclass/POSIX Character Classes> for details.
 
 =item [3]
 
-See L<perlrecharclass/Backslash sequences> for details.
+See L<perlunicode/Unicode Character Properties> for details
 
 =item [4]
 

diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
@@ -405,6 +405,9 @@ non-Unicode code points.  This could be somewhat surprising:
 Even though these two matches might be thought of as complements, until
 v5.20 they were so only on Unicode code points.
 
+Starting in perl v5.30, wildcards are allowed in Unicode property
+values.  See L<perlunicode/Wildcards in Property Values>.
+
 =head4 Examples
 
  "a"  =~  /\w/      # Match, "a" is a 'word' character.

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
@@ -921,6 +921,145 @@ L<perlrecharclass/POSIX Character Classes>.
 
 =back
 
+=head2 Wildcards in Property Values
+
+Starting in Perl 5.30, it is possible to do do something like this:
+
+ qr!\p{numeric_value=/\A[0-5]\z/}!
+
+or, by abbreviating and adding C</x>,
+
+ qr! \p{nv= /(?x) \A [0-5] \z / }!
+
+This matches all code points whose numeric value is one of 0, 1, 2, 3,
+4, or 5.  This particular example could instead have been written as
+
+ qr! \A [ \p{nv=0}\p{nv=1}\p{nv=2}\p{nv=3}\p{nv=4}\p{nv=5} ] \z !xx
+
+in earlier perls, so in this case this feature just makes things easier
+and shorter to write.  If we hadn't included the C<\A> and C<\z>, these
+would have matched things like C<1E<sol>2> because that contains a 1 (as
+well as a 2).  As written, it matches things like subscripts that have
+these numeric values.  If we only wanted the decimal digits with those
+numeric values, we could say,
+
+ qr! (?[ \d & \p{nv=/[0-5]/ ]) }!x
+
+The C<\d> gets rid of needing to anchor the pattern, since it forces the
+result to only match C<[0-9]>, and the C<[0-5]> further restricts it.
+
+The text in the above examples enclosed between the C<"E<sol>">
+characters can be just about any regular expression.  It is independent
+of the main pattern, so doesn't share any capturing groups, I<etc>.  The
+delimiters for it must be ASCII punctuation, but it may NOT be
+delimited by C<"{">, nor C<"}"> nor contain a literal C<"}">, as that
+delimits the end of the enclosing C<\p{}>.  Like any pattern, certain
+other delimiters are terminated by their mirror images.  These are
+C<"(">, C<"[>", and C<"E<lt>">.  If the delimiter is any of C<"-">,
+C<"_">, C<"+">, or C<"\">, or is the same delimiter as is used for the
+enclosing pattern, it must be be preceded by a backslash escape, both
+fore and aft.
+
+Beware of using C<"$"> to indicate to match the end of the string.  It
+can too easily be interpreted as being a punctuation variable, like
+C<$/>.
+
+No modifiers may follow the final delimiter.  Instead, use
+L<perlre/(?adlupimnsx-imnsx)> and/or
+L<perlre/(?adluimnsx-imnsx:pattern)> to specify modifiers.
+
+This feature is not available when the left-hand side is prefixed by
+C<Is_>, nor for any form that is marked as "Discouraged" in
+L<perluniprops/Discouraged>.
+
+Perl wraps your pattern with C<(?iaa: ... )>.  This is because nothing
+outside ASCII can match the Unicode property values available in this
+release, and they should match caselessly.  If your pattern has a syntax
+error, this wrapping will be shown in the error message, even though you
+didn't specify it yourself.  This could be confusing if you don't know
+about this.
+
+This experimental feature has been added to begin to implement
+L<https://www.unicode.org/reports/tr18/#Wildcard_Properties>.  Using it
+will raise a (default-on) warning in the
+C<experimental::uniprop_wildcards> category.  We reserve the right to
+change its operation as we gain experience.
+
+Your subpattern can be just about anything, but for it to have some
+utility, it should match when called with either or both of
+a) the full name of the property value with underscores (and/or spaces
+in the Block property) and some things uppercase; or b) the property
+value in all lowercase with spaces and underscores squeezed out.  For
+example,
+
+ qr!\p{Blk=/Old I.*/}!
+ qr!\p{Blk=/oldi.*/}!
+
+would match the same things.
+
+A warning is issued if none of the legal values for a property are
+matched by your pattern.  It's likely that a future release will raise a
+warning if your pattern ends up causing every possible code point to
+match.
+
+Another example that shows that within C<\p{...}>, C</x> isn't needed to
+have spaces:
+
+ qr!\p{scx= /Hebrew|Greek/ }!
+
+To be safe, we should have anchored the above example, to prevent
+matches for something like C<Hebrew_Braile>, but there aren't
+any script names like that.
+
+There are certain properties that it doesn't currently work with.  These
+are:
+
+ Bidi Mirroring Glyph
+ Bidi Paired Bracket
+ Case Folding
+ Decomposition Mapping
+ Equivalent Unified Ideograph
+ Name
+ Name Alias
+ Lowercase Mapping
+ NFKC Case Fold
+ Titlecase Mapping
+ Uppercase Mapping
+
+Nor is the C<@I<unicode_property>@> form implemented.
+
+Here's a complete example of matching IPV4 internet protocol addresses
+in any (single) script
+
+ no warnings 'experimental::script_run';
+ no warnings 'experimental::regex_sets';
+ no warnings 'experimental::uniprop_wildcards';
+
+ # Can match a substring, so this intermediate regex needs to have
+ # context or anchoring in its final use.  Using nt=de yields decimal
+ # digits.  When specifying a subset of these, we must include \d to
+ # prevent things like U+00B2 SUPERSCRIPT TWO from matching
+ my $zero_through_255 =
+  qr/ \b (*sr:                                  # All from same sript
+            (?[ \p{nv=0} & \d ])*               # Optional leading zeros
+        (                                       # Then one of:
+                                  \d{1,2}       #   0 - 99
+            | (?[ \p{nv=1} & \d ])  \d{2}       #   100 - 199
+            | (?[ \p{nv=2} & \d ])
+               (  (?[ \p{nv=:[0-4]:} & \d ]) \d #   200 - 249
+                | (?[ \p{nv=5}     & \d ])
+                  (?[ \p{nv=:[0-5]:} & \d ])    #   250 - 255
+               )
+        )
+      )
+    \b
+  /x;
+
+ my $ipv4 = qr/ \A (*sr:         $zero_through_255
+                         (?: [.] $zero_through_255 ) {3}
+                   )
+                \z
+            /x;
 
 =head2 User-Defined Character Properties
 
@@ -1220,7 +1359,7 @@ C<U+10FFFF> but also beyond C<U+10FFFF>
  RL2.3   Default Word Boundaries         - Done          [11]
  RL2.4   Default Case Conversion         - Done
  RL2.5   Name Properties                 - Done
- RL2.6   Wildcard Properties             - Missing
+ RL2.6   Wildcards in Property Values    - Partial       [12]
  RL2.7   Full Properties                 - Done
 
 =over 4
@@ -1239,6 +1378,9 @@ Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode".
 =item [11] see
 L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>,
 
+=item [12] see
+L</Wildcards in Property Values> above.
+
 =back
 
 =head3 Level 3 - Tailored Support
@@ -1272,7 +1414,7 @@ portion.
 Perl has user-defined properties (L</"User-Defined Character
 Properties">) to look at single code points in ways beyond Unicode, and
 it might be possible, though probably not very clean, to use code blocks
-and things like C<(?(DEFINE)...)> (see L<perlre> to do more specialized
+and things like C<(?(DEFINE)...)> (see L<perlre>) to do more specialized
 matching.
 
 =back