Skip to content

Commit

Permalink
RT #130907: Fix the Unicode Bug in split " "
Browse files Browse the repository at this point in the history
  • Loading branch information
arc authored and jkeenan committed Mar 4, 2017
1 parent 83cad69 commit 3545146
Show file tree
Hide file tree
Showing 8 changed files with 71 additions and 9 deletions.
7 changes: 4 additions & 3 deletions lib/feature.pm
Expand Up @@ -5,7 +5,7 @@

package feature;

our $VERSION = '1.47';
our $VERSION = '1.48';

our %feature = (
fc => 'feature_fc',
Expand Down Expand Up @@ -175,8 +175,9 @@ C<use feature 'unicode_strings'> subpragma is B<strongly> recommended.
This feature is available starting with Perl 5.12; was almost fully
implemented in Perl 5.14; and extended in Perl 5.16 to cover C<quotemeta>;
and extended further in Perl 5.26 to cover L<the range
operator|perlop/Range Operators>.
was extended further in Perl 5.26 to cover L<the range
operator|perlop/Range Operators>; and was extended again in Perl 5.28 to
cover L<special-cased whitespace splitting|perlfunc/split>.
=head2 The 'unicode_eval' and 'evalbytes' features
Expand Down
9 changes: 9 additions & 0 deletions pod/perldelta.pod
Expand Up @@ -343,6 +343,15 @@ expression had no named captures. The same applies to access to any
hash tied with L<Tie::Hash::NamedCapture> and C<< all => 1 >>. [perl
#130822]

=item *

C<split ' '> now handles the argument being split correctly when in the
scope of the L<< C<unicode_strings>|feature/"The 'unicode_strings' feature"
>> feature. Previously, when a string using the single-byte internal
representation contained characters that are whitespace by Unicode rules but
not by ASCII rules, it treated those characters as part of fields rather
than as field separators. This resolves [perl #130907].

=back

=head1 Known Problems
Expand Down
8 changes: 8 additions & 0 deletions pod/perlfunc.pod
Expand Up @@ -7601,6 +7601,14 @@ special case was restricted to the use of a plain S<C<" ">> as the
pattern argument to split; in Perl 5.18.0 and later this special case is
triggered by any expression which evaluates to the simple string S<C<" ">>.

As of Perl 5.28, this special-cased whitespace splitting works as expected in
the scope of L<< S<C<"use feature 'unicode_strings">>|feature/The
'unicode_strings' feature >>. In previous versions, and outside the scope of
that feature, it exhibits L<perlunicode/The "Unicode Bug">: characters that are
whitespace according to Unicode rules but not according to ASCII rules can be
treated as part of fields rather than as field separators, depending on the
string's internal encoding.

If omitted, PATTERN defaults to a single space, S<C<" ">>, triggering
the previously described I<awk> emulation.

Expand Down
11 changes: 11 additions & 0 deletions pod/perlunicode.pod
Expand Up @@ -1824,6 +1824,17 @@ outside its scope, it could produce strings whose length in characters
exceeded that of the right-hand side, where the right-hand side took up more
bytes than the correct range endpoint.

=item *

In L<< C<split>'s special-case whitespace splitting|perlfunc/split >>.

Starting in Perl 5.28.0, the C<split> function with a pattern specified as
a string containing a single space handles whitespace characters consistently
within the scope of of C<unicode_strings>. Prior to that, or outside its scope,
characters that are whitespace according to Unicode rules but not according to
ASCII rules were treated as field contents rather than field separators when
they appear in byte-encoded strings.

=back

You can see from the above that the effect of C<unicode_strings>
Expand Down
5 changes: 3 additions & 2 deletions pod/perluniintro.pod
Expand Up @@ -151,11 +151,12 @@ serious Unicode work. The maintenance release 5.6.1 fixed many of the
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
Perl v5.14.0 is the first release where Unicode support is
(almost) seamlessly integrable without some gotchas. (There are two
(almost) seamlessly integrable without some gotchas. (There are a few
exceptions. Firstly, some differences in L<quotemeta|perlfunc/quotemeta>
were fixed starting in Perl 5.16.0. Secondly, some differences in
L<the range operator|perlop/Range Operators> were fixed starting in
Perl 5.26.0.)
Perl 5.26.0. Thirdly, some differences in L<split|perlfunc/split> were fixed
started in Perl 5.28.0.)

To enable this
seamless support, you should C<use feature 'unicode_strings'> (which is
Expand Down
13 changes: 13 additions & 0 deletions pp.c
Expand Up @@ -5716,6 +5716,7 @@ PP(pp_split)
STRLEN len;
const char *s = SvPV_const(sv, len);
const bool do_utf8 = DO_UTF8(sv);
const bool in_uni_8_bit = IN_UNI_8_BIT;
const char *strend = s + len;
PMOP *pm = cPMOPx(PL_op);
REGEXP *rx;
Expand Down Expand Up @@ -5802,6 +5803,10 @@ PP(pp_split)
while (s < strend && isSPACE_LC(*s))
s++;
}
else if (in_uni_8_bit) {
while (s < strend && isSPACE_L1(*s))
s++;
}
else {
while (s < strend && isSPACE(*s))
s++;
Expand Down Expand Up @@ -5833,6 +5838,10 @@ PP(pp_split)
{
while (m < strend && !isSPACE_LC(*m))
++m;
}
else if (in_uni_8_bit) {
while (m < strend && !isSPACE_L1(*m))
++m;
} else {
while (m < strend && !isSPACE(*m))
++m;
Expand Down Expand Up @@ -5867,6 +5876,10 @@ PP(pp_split)
{
while (s < strend && isSPACE_LC(*s))
++s;
}
else if (in_uni_8_bit) {
while (s < strend && isSPACE_L1(*s))
++s;
} else {
while (s < strend && isSPACE(*s))
++s;
Expand Down
7 changes: 4 additions & 3 deletions regen/feature.pl
Expand Up @@ -367,7 +367,7 @@ sub longest {
__END__
package feature;
our $VERSION = '1.47';
our $VERSION = '1.48';
FEATURES
Expand Down Expand Up @@ -485,8 +485,9 @@ =head2 The 'unicode_strings' feature
This feature is available starting with Perl 5.12; was almost fully
implemented in Perl 5.14; and extended in Perl 5.16 to cover C<quotemeta>;
and extended further in Perl 5.26 to cover L<the range
operator|perlop/Range Operators>.
was extended further in Perl 5.26 to cover L<the range
operator|perlop/Range Operators>; and was extended again in Perl 5.28 to
cover L<special-cased whitespace splitting|perlfunc/split>.
=head2 The 'unicode_eval' and 'evalbytes' features
Expand Down
20 changes: 19 additions & 1 deletion t/op/split.t
Expand Up @@ -7,7 +7,7 @@ BEGIN {
set_up_inc('../lib');
}

plan tests => 163;
plan tests => 172;

$FS = ':';

Expand Down Expand Up @@ -480,6 +480,24 @@ is($cnt, scalar(@ary));
qq{split(\$cond ? qr/ / : " ", "$exp") behaves as expected over repeated similar patterns};
}

SKIP: {
# RT #130907: unicode_strings feature doesn't work with split ' '

my ($sp) = grep /\s/u, map chr, reverse 128 .. 255 # prefer \xA0 over \x85
or skip 'no unicode whitespace found in high-8-bit range', 9;

for (["$sp$sp. /", "leading unicode whitespace"],
[".$sp$sp/", "unicode whitespace separator"],
[". /$sp$sp", "trailing unicode whitespace"]) {
my ($str, $desc) = @$_;
use feature "unicode_strings";
my @got = split " ", $str;
is @got, 2, "whitespace split: $desc: field count";
is $got[0], '.', "whitespace split: $desc: field 0";
is $got[1], '/', "whitespace split: $desc: field 1";
}
}

{
# 'RT #116086: split "\x20" does not work as documented';
my @results;
Expand Down

0 comments on commit 3545146

Please sign in to comment.