Skip to content

Commit

Permalink
Reword discussion of /d regexp modifier.
Browse files Browse the repository at this point in the history
The phrasing as it stood confused UTF8-flagged strings with
“UTF-8 encoded”. The latter term should refer to strings that the
Perl application has actually encode()d, which probably *won’t*
be UTF8-flagged and thus won’t, per /d modifier rules, get the Unicode
treatment.

This also removes an incorrect statement about only ASCII characters
matching in the absence of (the UTF8 flag). This is trivially false
given that "\xff" =~ /\xff/ is truthy.

This also reorders and rewords some parts in an attempt to clarify that
new code should avoid this flag, including use of the 'unicode_strings'
feature to avoid implicit use.
  • Loading branch information
FGasper committed Aug 27, 2021
1 parent 3bbdeca commit 5fbaad5
Showing 1 changed file with 31 additions and 20 deletions.
51 changes: 31 additions & 20 deletions pod/perlre.pod
Original file line number Diff line number Diff line change
Expand Up @@ -678,18 +678,27 @@ X</u>

=head4 /d

B<IMPORTANT:> Because of the unpredictable behaviors this
modifier causes, only use it to maintain weird backward compatibilities.
Use the
L<< C<unicode_strings>|feature/"The 'unicode_strings' feature" >>
feature
in new code to avoid inadvertently enabling this modifier by default.

This modifier means to use the "Default" native rules of the platform
except when there is cause to use Unicode rules instead, as follows:

=over 4

=item 1

the target string is encoded in UTF-8; or
the target string's L<UTF8 flag|perlunifaq/What is "the UTF8 flag"?>
(see below) is set; or

=item 2

the pattern is encoded in UTF-8; or
the pattern's L<UTF8 flag|perlunifaq/What is "the UTF8 flag"?>
(see below) is set; or

=item 3

Expand Down Expand Up @@ -718,30 +727,32 @@ the pattern uses L<C<(*script_run: ...)>|/Script Runs>

=back

Another mnemonic for this modifier is "Depends", as the rules actually
used depend on various things, and as a result you can get unexpected
results. See L<perlunicode/The "Unicode Bug">. The Unicode Bug has
become rather infamous, leading to yet other (without swearing) names
for this modifier, "Dicey" and "Dodgy".

Unless the pattern or string are encoded in UTF-8, only ASCII characters
can match positively.
Regarding the "UTF8 flag" references above: Another mnemonic for this
modifier is "Depends". This is because that UTF8 flag isn't something
Perl applications should think about; it's part of Perl's internals,
so it can change whenever Perl wants. C</d> may thus cause unpredictable
results. See L<perlunicode/The "Unicode Bug">. This bug
has become rather infamous, leading to yet other (without swearing) names
for this modifier like "Dicey" and "Dodgy".

Here are some examples of how that works on an ASCII platform:

$str = "\xDF"; # $str is not in UTF-8 format.
$str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
$str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
$str =~ /^\w/; # Match! $str is now in UTF-8 format.
$str = "\xDF"; #
utf8::downgrade($str); # $str is not UTF8-flagged.
$str =~ /^\w/; # No match, since no UTF8 flag.

$str .= "\x{0e0b}"; # Now $str is UTF8-flagged.
$str =~ /^\w/; # Match! $str is now UTF8-flagged.
chop $str;
$str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
$str =~ /^\w/; # Still a match! $str retains its UTF8 flag.

This modifier is automatically selected by default when none of the
others are, so yet another name for it is "Default".
Under Perl's default configuration this modifier is automatically
selected by default when none of the others are, so yet another name
for it (unfortunately) is "Default".

Because of the unexpected behaviors associated with this modifier, you
probably should only explicitly use it to maintain weird backward
compatibilities.
Whenever you can, use the
L<< C<unicode_strings>|feature/"The 'unicode_strings' feature" >>
to cause X</u> to be the default instead.

=head4 /a (and /aa)

Expand Down

0 comments on commit 5fbaad5

Please sign in to comment.