Skip to content

Commit

Permalink
regcomp.c - Exclude split STRING from /x default modifiers.
Browse files Browse the repository at this point in the history
We have a magic special case with split STRING where if the STRING
is a single space character we trigger awk emaulation mode, which
causes split to ignore leading whitespace, and to split on \s+.

When default regex modifier support was added to perl we didn't notice
that it doesn't make sense for the /x modifiers to change the behavior
of a STRING pattern in split. STRING patterns are NOT parsed using the
normal pattern parsing rules, for instance "\s+" would split on 's'
characters, not on whitepace. The /x modifier is designed to change the
rules for *pattern* parsing, most notably by adding support for comments
and ignoring whitespace in pattern definitions. This feels to me like a
feature that should only apply to cases where a true pattern is parsed,
such as where \s+ would be treated as the special pattern for matching
whitespace and not a literal 's' character. Historically there was NO way
to create a string pattern with the /x pattern enabled, to enable /x the
string MUST have been fed into a true pattern parsing construct like
qr// m// or s///.

By excluding STRING patterns from /x notation, we ensure that the
special case of awk emulation for C<split " "> is preserved.

Note that a default /x modifier *does* change the behavior of
C<split / />, as the pattern in that case is parsed the same as
C<split qr/ /> or C<m/ />.

With changes by Yves Orton

Fixes #18032
  • Loading branch information
richardleach authored and demerphq committed Mar 13, 2024
1 parent 4d119e9 commit 216a5aa
Show file tree
Hide file tree
Showing 5 changed files with 56 additions and 3 deletions.
7 changes: 6 additions & 1 deletion ext/re/re.pm
Expand Up @@ -4,7 +4,7 @@ package re;
use strict;
use warnings;

our $VERSION = "0.45";
our $VERSION = "0.46";
our @ISA = qw(Exporter);
our @EXPORT_OK = qw{
is_regexp regexp_pattern
Expand Down Expand Up @@ -500,6 +500,11 @@ example:
use re "/l";
no re "/l"; # reverts to unicode_strings behaviour
Default flags are applied to wherever a pattern is compiled with the exception
of the C</x> flag, which is not applied to patterns compiled from string arguments
to C<split>. Thus `use re "/x";` does not affect the behaviour of C<split " "> but
B<does> affect the behavior of C<split / />.
=head2 'debug' mode
When C<use re 'debug'> is in effect, perl emits debugging messages when
Expand Down
7 changes: 7 additions & 0 deletions pod/perlfunc.pod
Expand Up @@ -8317,6 +8317,13 @@ whitespace according to Unicode rules but not according to ASCII rules can be
treated as part of fields rather than as field separators, depending on the
string's internal encoding.

As of Perl 5.39.9 the C</x> default modifier does NOT affect
C<split STRING> but does affect C<split PATTERN>, this means that
C<split " "> will produces the expected I<awk> emulation regardless as

This comment has been minimized.

Copy link
@druud

druud Mar 13, 2024

Contributor

"will produce"

to whether it is used in the scope of a C<use re "/x"> statement. If you
want to split by spaces under C<use re "/x"> you must do something like
C<split /(?-x: )/> or C<split /\x{20}/> instead of C<split / />.

If omitted, PATTERN defaults to a single space, S<C<" ">>, triggering
the previously described I<awk> emulation.

Expand Down
3 changes: 2 additions & 1 deletion pod/perlre.pod
Expand Up @@ -844,7 +844,8 @@ described in the remainder of this section.
The C<L<use re 'E<sol>foo'|re/"'/flags' mode">> pragma can be used to set
default modifiers (including these) for regular expressions compiled
within its scope. This pragma has precedence over the other pragmas
listed below that also change the defaults.
listed below that also change the defaults. Note that the /x modifier does
NOT affect C<split STR> patterns.

This comment has been minimized.

Copy link
@druud

druud Mar 13, 2024

Contributor

"STR"?
"split STRING" is more used elsewhere.


Otherwise, C<L<use locale|perllocale>> sets the default modifier to C</l>;
and C<L<use feature 'unicode_strings|feature>>, or
Expand Down
2 changes: 2 additions & 0 deletions regcomp.c
Expand Up @@ -1496,6 +1496,8 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
FAIL("Regexp out of space");

rx_flags = orig_rx_flags;
if (rx_flags & RXf_SPLIT)
rx_flags &= ~(RXf_PMf_EXTENDED|RXf_PMf_EXTENDED_MORE);

if ( toUSE_UNI_CHARSET_NOT_DEPENDS
&& initial_charset == REGEX_DEPENDS_CHARSET)
Expand Down
40 changes: 39 additions & 1 deletion t/op/split.t
Expand Up @@ -7,7 +7,7 @@ BEGIN {
require './charset_tools.pl';
}

plan tests => 197;
plan tests => 219;

$FS = ':';

Expand Down Expand Up @@ -725,3 +725,41 @@ SKIP: {
}, "special-case pattern for $prog");
}
}

# gh18032: check that `split " "` does not get converted to `split ""`
SKIP: {
my @skipwhite= ('split " "', 'split "\x20"', 'split "\N{SPACE}"',
'split "$e$sp$e"', 'split');
my @noskipwhite= (
'split / /', 'split m/ /', 'split qr/ /',
'split /$e$sp$e/', 'split m/$e$sp$e/', 'split qr/$e$sp$e/'
);
skip_if_miniperl("special-case patterns: need dynamic loading",
2*(@skipwhite+@noskipwhite));

my $modifiers = "x"; # the original bug report used /aansx

for my $prog ( @skipwhite ) {
fresh_perl_like("use re qw(/$modifiers); \$sp=qq( ); \$e=qq(); $prog;",
qr{^r->extflags:.*\bSKIPWHITE\b\s\bWHITE\b}m,
{switches => [ '-Mre=Debug,COMPILE' ]},
"$prog sets SKIPWHITE|WHITE under `use re qw(/$modifiers)`");

fresh_perl_like("use re qw(/$modifiers); \$sp=qq( ); \$e=qq();"
."\$_=qq( 1 1 ); \@c=$prog; print 0+\@c, qq(<\@c>)",
qr{^2<1 1>}m,
{},
"$prog matches as expected `use re qw(/$modifiers)`");
}
for my $prog ( @noskipwhite) {
fresh_perl_like("use re qw(/$modifiers); \$sp=qq( ); \$e=qq(); $prog;",
qr{^r->extflags:.*\bNULL\b}m,
{switches => [ '-Mre=Debug,COMPILE' ]},
"$prog does not set SKIPWHITE|WHITE under `use re qw(/$modifiers)`");
fresh_perl_like("use re qw(/$modifiers); \$sp=qq( ); \$e=qq();"
."\$_=qq( 1 1 ); \@c=$prog; print 0+\@c, qq(<\@c>)",
qr{^6< 1 1 >}m,
{},
"$prog matches expected under `use re qw(/$modifiers)`");
}
}

0 comments on commit 216a5aa

Please sign in to comment.