erroneous regex warning after utf8 conversion #15958

p5pRT · 2017-04-21T23:14:33Z

Migrated from rt.perl.org#131190 (status was 'resolved')

Searchable as RT131190$

p5pRT · 2017-04-21T23:14:33Z

From saint.snit@gmail.com

This is a bug report for perl from saint.snit@gmail.com,
generated with the help of perlbug 1.40 running under perl 5.22.2.

Perl emits an inapplicable warning for some regular expressions.
Here is a
code block that demonstrates the bug:

use experimental 'smartmatch';
use POSIX 'locale_h';
use locale ':ctype';
setlocale(LC_CTYPE, 'en_US');

$_ = "x";
utf8::upgrade($_);
/x(y|z)?/;

This produces the output:

Wide character (U+FFFD) in pattern match (m//) at - line 8.

even though no such character is ever specified.

The bug only seems to occur with the particular combination of "use"
statements above; for instance, it does not occur if the "smartmatch" line
is omitted, even though the code uses no smart matching.

(For reference, discussion about this bug originated at
http://perlmonks.com/?node_id=1188156)

Thank you for your time.

Flags:
category=core
severity=low

Site configuration information for perl 5.22.2:

Configured by Gentoo at Thu Oct 20 22:32:43 CDT 2016.

Summary of my perl5 (revision 5 version 22 subversion 2) configuration:

Platform:
osname=linux, osvers=4.4.21-gentoo-evo, archname=i686-linux
uname='linux evo 4.4.21-gentoo-evo #1 smp sun oct 9 21:43:27 cdt
2016 i686 intel(r) pentium(r) 4 cpu 1.70ghz genuineintel gnulinux '
config_args='-des -Duseshrplib -Darchname=i686-linux
-Dcc=i686-pc-linux-gnu-gcc -Doptimize=-O2 -march=native -pipe
-Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr
-Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin
-Dprivlib=/usr/lib/perl5/5.22.2 -Darchlib=/usr/lib/perl5/5.22.2/i686-linux
-Dsitelib=/usr/local/lib/perl5/5.22.2
-Dsitearch=/usr/local/lib/perl5/5.22.2/i686-linux
-Dvendorlib=/usr/lib/perl5/vendor_perl/5.22.2
-Dvendorarch=/usr/lib/perl5/vendor_perl/5.22.2/i686-linux
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
-Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3
-Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3
-Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.22.2
-Dlocincpth=/usr/include -Dglibpth=/lib /usr/lib -Duselargefiles
-Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost
-Dperladmin=root@localhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm
-Di_gdbm -Di_db -DDEBUGGING=none -Dinc_version_list=5.22.0/i686-linux
5.22.0 5.22.1/i686-linux 5.22.1 -Dnoextensions=ODBM_File'
hint=recommended, useposix=true, d_sigaction=define
useithreads=undef, usemultiplicity=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='i686-pc-linux-gnu-gcc', ccflags ='-fwrapv -fno-strict-aliasing
-pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-O2 -march=native -pipe',
cppflags='-fwrapv -fno-strict-aliasing -pipe'
ccversion='', gccversion='4.9.3', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234,
doublekind=3
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12,
longdblkind=3
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='i686-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
libpth=/usr/lib/gcc/i686-pc-linux-gnu/4.9.3/include-fixed /usr/lib
/lib/../lib /usr/lib/../lib /lib
libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
-lgdbm_compat
perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
libc=libc-2.22.so, so=so, useshrplib=true, libperl=libperl.so.5.22.2
gnulibc_version='2.22'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fPIC', lddlflags='-shared -O2 -march=native -pipe
-Wl,-O1 -Wl,--as-needed'

Locally applied patches:
gentoo/hints_hpux - Fix hpux hints
gentoo/aix_soname - aix gcc detection and shared library soname
support
gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054
cpan/ExtUtils-MakeMaker: drop $PORTAGE_TMPDIR from LD_RUN_PATH
gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
gentoo/opensolaris_headers - [PATCH] Add headers for opensolaris
gentoo/patchlevel - List packaged patches for perl-5.22.2(#1)
in patchlevel.h
gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site
directories by default.
gentoo/cleanup-paths - [PATCH] Cleanup PATH and shrpenv
gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing
@INC directories.
gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 [PATCH] darwin:
Use $CC to link
gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default
for modules installed from CPAN.
gentoo/interix - [PATCH] Fix interix hints
gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 [PATCH]
Set libperl soname
gentoo/mod_paths - Add /etc/perl to @INC
gentoo/EUMM_perllocalpod -
gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 [PATCH]
Don't force -fstack-protector on everyone
gentoo/usr_local - [PATCH] Remove /usr/local paths
gentoo/D-SHA-CFLAGS - https://bugs.gentoo.org/506818 [PATCH] [PATCH]
Do not set custom CFLAGS in cpan/Digest-SHA
gentoo/io_socket_ip_tests -
gentoo/tests.patch -
debian/cpan-missing-site-dirs - Fix CPAN::FirstTime defaults with
nonexisting site dirs if a parent is writable
debian/makemaker-pasthru - Pass LD settings through to subdirectories
debian/locale-robustness - [perl #124310] Make t/run/locale.t survive
missing locales masked by LC_ALL
fixes/memoize_storable_nstore - [rt.cpan.org #77790]
Memoize::Storable: respect 'nstore' option not respected
fixes/podman-pipe - Better errors for man pages from standard input
fixes/respect_umask - Respect umask during installation
fixes/podman-utc - Make the embedded date from Pod::Man reproducible
fixes/podman-utc-docs - Documentation and test suite updates for
UTC fix
fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net::SMTP
'Port' option
fixes/document_makemaker_ccflags - [rt.cpan.org #68613] Document
that CCFLAGS should include $Config{ccflags}

@INC for perl 5.22.2:
/etc/perl
/usr/local/lib/perl5/5.22.2/i686-linux
/usr/local/lib/perl5/5.22.2
/usr/lib/perl5/vendor_perl/5.22.2/i686-linux
/usr/lib/perl5/vendor_perl/5.22.2
/usr/local/lib/perl5
/usr/lib/perl5/vendor_perl
/usr/lib/perl5/5.22.2/i686-linux
/usr/lib/perl5/5.22.2
.

Environment for perl 5.22.2:
HOME=/home/vax
LANG=C
LANGUAGE (unset)
LC_CTYPE=en_US.iso88591
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/home/vax/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.9.3:/sbin:/usr/sbin:/usr/games/bin:./bin
PERL_BADLANG (unset)
SHELL=/bin/bash

p5pRT · 2017-04-23T21:32:54Z

From zefram@fysh.org

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at ../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

-zefram

p5pRT · 2017-04-23T21:32:54Z

The RT System itself - Status changed from 'new' to 'open'

p5pRT · 2017-04-24T04:20:04Z

From @khwilliamson

On 04/23/2017 01:33 PM, Zefram wrote:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at ../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

Actually, the decode function shouldn't be getting called at all if
there is nothing to decode.

-zefram

p5pRT · 2017-04-24T07:21:23Z

From @demerphq

On 24 April 2017 at 06:19, Karl Williamson <public@khwilliamson.com> wrote:

On 04/23/2017 01:33 PM, Zefram wrote:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at
../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

Actually, the decode function shouldn't be getting called at all if there is
nothing to decode.

I pushed a fix for this yesterday.

Dave M changed the var "nextchr" so that it supported a negative
argument beyond its normal duties as holding the codepoint of the
nextchar in the regex matching. This negative argument, -10, is used
to represent the end of string. (All very reasonable.)

Unfortunately this doesn't play entirely nicely with the utf8 code
which for other generally good reasons tends to cast its arguments to
U8 (to deal with "char" inputs). So the code:

if (utf8_target
&& UTF8_IS_ABOVE_LATIN1(nextchr)

sees the -10, casts it to a U8, producing 255, and then considers it
to be the first octet of a (malformed presumably) utf8 sequence.

The fix was to change this to

if (utf8_target
&& nextchr >= 0
&& UTF8_IS_ABOVE_LATIN1(nextchr)

I was going to push this to a branch etc, but I managed to make a big
mess of things yesterday and pushed it to trunk along with a bunch of
other junk stuff which was meant for after the code freeze.

After running out of time cleaning up the stuff that definitely
shouldnt be part of this next release, and being somewhat unwilling to
allow a perl to be released with such an obvious bug and obviously
safe bugfix unapplied I decided to leave it for today to sort out.

So this issues is fixed and understood, but the exact status of the
patch is a bit up in the air. If sawyer says to revert it then I will,
but I feel this patch is safe enough to be part of the release.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2017-04-26T16:17:55Z

From @khwilliamson

On 04/24/2017 01:20 AM, demerphq wrote:

On 24 April 2017 at 06:19, Karl Williamson <public@khwilliamson.com> wrote:

On 04/23/2017 01:33 PM, Zefram wrote:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at
../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

Actually, the decode function shouldn't be getting called at all if there is
nothing to decode.

I pushed a fix for this yesterday.

Dave M changed the var "nextchr" so that it supported a negative
argument beyond its normal duties as holding the codepoint of the
nextchar in the regex matching. This negative argument, -10, is used
to represent the end of string. (All very reasonable.)

Unfortunately this doesn't play entirely nicely with the utf8 code
which for other generally good reasons tends to cast its arguments to
U8 (to deal with "char" inputs). So the code:
                 if $utf8\_target
                    && UTF8\_IS\_ABOVE\_LATIN1\(nextchr$
sees the -10, casts it to a U8, producing 255, and then considers it
to be the first octet of a (malformed presumably) utf8 sequence.

The fix was to change this to
                 if $utf8\_target
                    && nextchr >= 0
                    && UTF8\_IS\_ABOVE\_LATIN1\(nextchr$
I was going to push this to a branch etc, but I managed to make a big
mess of things yesterday and pushed it to trunk along with a bunch of
other junk stuff which was meant for after the code freeze.

After running out of time cleaning up the stuff that definitely
shouldnt be part of this next release, and being somewhat unwilling to
allow a perl to be released with such an obvious bug and obviously
safe bugfix unapplied I decided to leave it for today to sort out.

So this issues is fixed and understood, but the exact status of the
patch is a bit up in the air. If sawyer says to revert it then I will,
but I feel this patch is safe enough to be part of the release.

Yves

Note that no test for this problem has been committed.

p5pRT · 2017-04-26T16:52:45Z

From @demerphq

No. Because we are debating if it should be reverted due to code freeze. I
personally think this is sufficiently isolated we should keep it. I am
waiting on sawyer to make a call...

Yves

On 26 Apr 2017 6:17 p.m., "Karl Williamson" <public@khwilliamson.com> wrote:

On 04/24/2017 01:20 AM, demerphq wrote:
On 24 April 2017 at 06:19, Karl Williamson <public@khwilliamson.com>
wrote:

On 04/23/2017 01:33 PM, Zefram wrote:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to
smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at
../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

Actually, the decode function shouldn't be getting called at all if
there is
nothing to decode.

I pushed a fix for this yesterday.

Dave M changed the var "nextchr" so that it supported a negative
argument beyond its normal duties as holding the codepoint of the
nextchar in the regex matching. This negative argument, -10, is used
to represent the end of string. (All very reasonable.)

Unfortunately this doesn't play entirely nicely with the utf8 code
which for other generally good reasons tends to cast its arguments to
U8 (to deal with "char" inputs). So the code:
                 if $utf8\_target
                    && UTF8\_IS\_ABOVE\_LATIN1\(nextchr$
sees the -10, casts it to a U8, producing 255, and then considers it
to be the first octet of a (malformed presumably) utf8 sequence.

The fix was to change this to
                 if $utf8\_target
                    && nextchr >= 0
                    && UTF8\_IS\_ABOVE\_LATIN1\(nextchr$
I was going to push this to a branch etc, but I managed to make a big
mess of things yesterday and pushed it to trunk along with a bunch of
other junk stuff which was meant for after the code freeze.

After running out of time cleaning up the stuff that definitely
shouldnt be part of this next release, and being somewhat unwilling to
allow a perl to be released with such an obvious bug and obviously
safe bugfix unapplied I decided to leave it for today to sort out.

So this issues is fixed and understood, but the exact status of the
patch is a bit up in the air. If sawyer says to revert it then I will,
but I feel this patch is safe enough to be part of the release.

Yves
Note that no test for this problem has been committed.

p5pRT · 2017-04-26T17:22:14Z

From @khwilliamson

On 04/26/2017 10:52 AM, demerphq wrote:

No. Because we are debating if it should be reverted due to code freeze.
I personally think this is sufficiently isolated we should keep it. I am
waiting on sawyer to make a call...

Yves

Right, but if we keep it, shouldn't we also push a test?

Note that this is a regression introduced in 5.25.

On 26 Apr 2017 6:17 p.m., "Karl Williamson" <public@khwilliamson.com
<mailto:public@khwilliamson.com>> wrote:

On 04/24/2017 01&#8203;:20 AM\, demerphq wrote&#8203;:

    On 24 April 2017 at 06&#8203;:19\, Karl Williamson
    \<public@&#8203;khwilliamson\.com \<mailto&#8203;:public@&#8203;khwilliamson\.com>> wrote&#8203;:

        On 04/23/2017 01&#8203;:33 PM\, Zefram wrote&#8203;:


            Bisecting shows that the warning started appearing for
            that test script
            at v5\.21\.7\-165\-g613abc6 "Raise warning on multi\-byte
            char in single\-byte
            locale"\.

            Attempting to minimise the test script\, it turns out
            that the "use
            experimental" line is not required for any reason
            relating to smartmatch\,
            but simply for its effect on lexical warning flags\.
            Anything touching
            lexical warnings will do\, such as the simpler "use
            warnings"\.  And thus
            enabling all warnings produces an additional warning
            that sheds some
            light on the matter&#8203;:

            $ perl \.\./rt131190
            Malformed UTF\-8 character \(empty string\) in pattern
            match \(m//\) at
            \.\./rt131190 line 8\.
            Wide character \(U\+FFFD\) in pattern match \(m//\) at
            \.\./rt131190 line 8\.

            Looks like the problem is that the check for wide
            characters should be
            passing in the UTF8\_ALLOW\_EMPTY flag\.  Without this\,
            when it's at end
            of string it perceives a malformed character\, for which
            it warns about
            malformation and substitutes in a replacement character\,
            which is wide
            and therefore triggers the wide character warning\.


        Actually\, the decode function shouldn't be getting called at
        all if there is
        nothing to decode\.


    I pushed a fix for this yesterday\.

    Dave M changed the var "nextchr" so that it supported a negative
    argument beyond its normal duties as holding the codepoint of the
    nextchar in the regex matching\. This negative argument\, \-10\, is used
    to represent the end of string\. \(All very reasonable\.\)

    Unfortunately this doesn't play entirely nicely with the utf8 code
    which for other generally good reasons tends to cast its
    arguments to
    U8 \(to deal with "char" inputs\)\. So the code&#8203;:

                         if \(utf8\_target
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    sees the \-10\, casts it to a U8\, producing 255\, and then considers it
    to be the first octet of a \(malformed presumably\) utf8 sequence\.

    The fix was to change this to


                         if \(utf8\_target
                            && nextchr >= 0
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    I was going to push this to a branch etc\, but I managed to make
    a big
    mess of things yesterday and pushed it to trunk along with a
    bunch of
    other junk stuff which was meant for after the code freeze\.

    After running out of time cleaning up the stuff that definitely
    shouldnt be part of this next release\, and being somewhat
    unwilling to
    allow a perl to be released with such an obvious bug and obviously
    safe bugfix unapplied I decided to leave it for today to sort out\.

    So this issues is fixed and understood\, but the exact status of the
    patch is a bit up in the air\. If sawyer says to revert it then I
    will\,
    but I feel this patch is safe enough to be part of the release\.

    Yves


Note that no test for this problem has been committed\.

p5pRT · 2017-04-26T21:38:01Z

From @demerphq

On 26 Apr 2017 7:21 p.m., "Karl Williamson" <public@khwilliamson.com> wrote:

On 04/26/2017 10:52 AM, demerphq wrote:

No. Because we are debating if it should be reverted due to code freeze.
I personally think this is sufficiently isolated we should keep it. I am
waiting on sawyer to make a call...

Yves

Right, but if we keep it, shouldn't we also push a test?

Note that this is a regression introduced in 5.25

I already messed up the change log with a bunch of reverts. Id rather not
add another patch if it's just going to be reverted. So if sawyer approves
this change despite the code freeze then we can add a test, otherwise let's
have less to revert.

But one way or another we will need a test for sure. ☺️

Yves

p5pRT · 2017-04-27T11:13:30Z

From @xsawyerx

Just for the official record (and relating to James' email), I'm still
looking into whether to keep it or not.

(I'd rather discuss it with Yves instead of making a single-sided decision.)

Everyone, feel free to provide your opinion. Just keep it polite. :)

On 04/26/2017 06:52 PM, demerphq wrote:

No. Because we are debating if it should be reverted due to code
freeze. I personally think this is sufficiently isolated we should
keep it. I am waiting on sawyer to make a call...

Yves

On 26 Apr 2017 6:17 p.m., "Karl Williamson" <public@khwilliamson.com
<mailto:public@khwilliamson.com>> wrote:

On 04/24/2017 01&#8203;:20 AM\, demerphq wrote&#8203;:

    On 24 April 2017 at 06&#8203;:19\, Karl Williamson
    \<public@&#8203;khwilliamson\.com \<mailto&#8203;:public@&#8203;khwilliamson\.com>> wrote&#8203;:

        On 04/23/2017 01&#8203;:33 PM\, Zefram wrote&#8203;:


            Bisecting shows that the warning started appearing for
            that test script
            at v5\.21\.7\-165\-g613abc6 "Raise warning on multi\-byte
            char in single\-byte
            locale"\.

            Attempting to minimise the test script\, it turns out
            that the "use
            experimental" line is not required for any reason
            relating to smartmatch\,
            but simply for its effect on lexical warning flags\. 
            Anything touching
            lexical warnings will do\, such as the simpler "use
            warnings"\.  And thus
            enabling all warnings produces an additional warning
            that sheds some
            light on the matter&#8203;:

            $ perl \.\./rt131190
            Malformed UTF\-8 character \(empty string\) in pattern
            match \(m//\) at
            \.\./rt131190 line 8\.
            Wide character \(U\+FFFD\) in pattern match \(m//\) at
            \.\./rt131190 line 8\.

            Looks like the problem is that the check for wide
            characters should be
            passing in the UTF8\_ALLOW\_EMPTY flag\.  Without this\,
            when it's at end
            of string it perceives a malformed character\, for
            which it warns about
            malformation and substitutes in a replacement
            character\, which is wide
            and therefore triggers the wide character warning\.


        Actually\, the decode function shouldn't be getting called
        at all if there is
        nothing to decode\.


    I pushed a fix for this yesterday\.

    Dave M changed the var "nextchr" so that it supported a negative
    argument beyond its normal duties as holding the codepoint of the
    nextchar in the regex matching\. This negative argument\, \-10\,
    is used
    to represent the end of string\. \(All very reasonable\.\)

    Unfortunately this doesn't play entirely nicely with the utf8 code
    which for other generally good reasons tends to cast its
    arguments to
    U8 \(to deal with "char" inputs\)\. So the code&#8203;:

                         if \(utf8\_target
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    sees the \-10\, casts it to a U8\, producing 255\, and then
    considers it
    to be the first octet of a \(malformed presumably\) utf8 sequence\.

    The fix was to change this to


                         if \(utf8\_target
                            && nextchr >= 0
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    I was going to push this to a branch etc\, but I managed to
    make a big
    mess of things yesterday and pushed it to trunk along with a
    bunch of
    other junk stuff which was meant for after the code freeze\.

    After running out of time cleaning up the stuff that definitely
    shouldnt be part of this next release\, and being somewhat
    unwilling to
    allow a perl to be released with such an obvious bug and obviously
    safe bugfix unapplied I decided to leave it for today to sort out\.

    So this issues is fixed and understood\, but the exact status
    of the
    patch is a bit up in the air\. If sawyer says to revert it then
    I will\,
    but I feel this patch is safe enough to be part of the release\.

    Yves


Note that no test for this problem has been committed\.

p5pRT · 2017-04-27T17:39:02Z

From @khwilliamson

On 04/27/2017 05:13 AM, Sawyer X wrote:

Just for the official record (and relating to James' email), I'm still
looking into whether to keep it or not.

(I'd rather discuss it with Yves instead of making a single-sided decision.)

Everyone, feel free to provide your opinion. Just keep it polite. :)

I think the patch should stay. It is extremely low risk, and fixes a
regression and hence would be eligible for 5.26.1 anyway.

But, I want to point out that this highlights a flaw in the perl
implementation. Here, a negative number is being used as a sentinel.
But in most of the implementation, code points can be any UV. If a user
has just the right code point point it would be treated as negative and
conflated with this sentinel. There are other situations where
negatives are similarly used. Because of this, code points outside the
range 0..IV_MAX are deprecated, and are scheduled to be forbidden in 5.28.

On 04/26/2017 06:52 PM, demerphq wrote:

No. Because we are debating if it should be reverted due to code
freeze. I personally think this is sufficiently isolated we should
keep it. I am waiting on sawyer to make a call...

Yves

On 26 Apr 2017 6:17 p.m., "Karl Williamson" <public@khwilliamson.com
<mailto:public@khwilliamson.com>> wrote:

On 04/24/2017 01&#8203;:20 AM\, demerphq wrote&#8203;:

    On 24 April 2017 at 06&#8203;:19\, Karl Williamson
    \<public@&#8203;khwilliamson\.com \<mailto&#8203;:public@&#8203;khwilliamson\.com>> wrote&#8203;:

        On 04/23/2017 01&#8203;:33 PM\, Zefram wrote&#8203;:


            Bisecting shows that the warning started appearing for
            that test script
            at v5\.21\.7\-165\-g613abc6 "Raise warning on multi\-byte
            char in single\-byte
            locale"\.

            Attempting to minimise the test script\, it turns out
            that the "use
            experimental" line is not required for any reason
            relating to smartmatch\,
            but simply for its effect on lexical warning flags\.
            Anything touching
            lexical warnings will do\, such as the simpler "use
            warnings"\.  And thus
            enabling all warnings produces an additional warning
            that sheds some
            light on the matter&#8203;:

            $ perl \.\./rt131190
            Malformed UTF\-8 character \(empty string\) in pattern
            match \(m//\) at
            \.\./rt131190 line 8\.
            Wide character \(U\+FFFD\) in pattern match \(m//\) at
            \.\./rt131190 line 8\.

            Looks like the problem is that the check for wide
            characters should be
            passing in the UTF8\_ALLOW\_EMPTY flag\.  Without this\,
            when it's at end
            of string it perceives a malformed character\, for
            which it warns about
            malformation and substitutes in a replacement
            character\, which is wide
            and therefore triggers the wide character warning\.


        Actually\, the decode function shouldn't be getting called
        at all if there is
        nothing to decode\.


    I pushed a fix for this yesterday\.

    Dave M changed the var "nextchr" so that it supported a negative
    argument beyond its normal duties as holding the codepoint of the
    nextchar in the regex matching\. This negative argument\, \-10\,
    is used
    to represent the end of string\. \(All very reasonable\.\)

    Unfortunately this doesn't play entirely nicely with the utf8 code
    which for other generally good reasons tends to cast its
    arguments to
    U8 \(to deal with "char" inputs\)\. So the code&#8203;:

                         if \(utf8\_target
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    sees the \-10\, casts it to a U8\, producing 255\, and then
    considers it
    to be the first octet of a \(malformed presumably\) utf8 sequence\.

    The fix was to change this to


                         if \(utf8\_target
                            && nextchr >= 0
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    I was going to push this to a branch etc\, but I managed to
    make a big
    mess of things yesterday and pushed it to trunk along with a
    bunch of
    other junk stuff which was meant for after the code freeze\.

    After running out of time cleaning up the stuff that definitely
    shouldnt be part of this next release\, and being somewhat
    unwilling to
    allow a perl to be released with such an obvious bug and obviously
    safe bugfix unapplied I decided to leave it for today to sort out\.

    So this issues is fixed and understood\, but the exact status
    of the
    patch is a bit up in the air\. If sawyer says to revert it then
    I will\,
    but I feel this patch is safe enough to be part of the release\.

    Yves


Note that no test for this problem has been committed\.

p5pRT · 2017-04-27T18:33:24Z

From @demerphq

On 27 April 2017 at 19:35, Karl Williamson <public@khwilliamson.com> wrote:

On 04/27/2017 05:13 AM, Sawyer X wrote:

Just for the official record (and relating to James' email), I'm still
looking into whether to keep it or not.

(I'd rather discuss it with Yves instead of making a single-sided
decision.)

Everyone, feel free to provide your opinion. Just keep it polite. :)

I think the patch should stay. It is extremely low risk, and fixes a
regression and hence would be eligible for 5.26.1 anyway.

Summarizes my view too. We regressed in 5.25.x on this, and despite
the fact we only noticed during the code freeze give the fix is as
safe as a fix can be I see no point in rolling out a known regression
in 5.26.x.

If there were any doubt that this patch could have a secondary
side-effect beside fixing this bug I would agree it should wait, and
again, if this had be present in 5.24 I would say it could wait. But
as is rolling out 5.26.x with this patch reverted means we are
knowingly rolling out 5.26.x with localized regexes broken. Given the
extremely low risk this seems extremely irresponsible.

But, I want to point out that this highlights a flaw in the perl
implementation. Here, a negative number is being used as a sentinel. But in
most of the implementation, code points can be any UV. If a user has just
the right code point point it would be treated as negative and conflated
with this sentinel. There are other situations where negatives are
similarly used. Because of this, code points outside the range 0..IV_MAX
are deprecated, and are scheduled to be forbidden in 5.28.

I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.

If I am correct in this then I believe the concern you raise here is
not a problem, where it would be if it held a code-point.

One of us should double check, or maybe Dave can speakup on this.

cheers,
Yves

p5pRT · 2017-04-27T19:42:43Z

From saint.snit@gmail.com

If there were any doubt that this patch could have a secondary
side-effect beside fixing this bug I would agree it should wait, and
again, if this had be present in 5.24 I would say it could wait.

Not to advocate for or against the patch's inclusion, but the bug has
existed since at least 5.22.2, which is where I discovered it.

p5pRT · 2017-04-27T19:47:12Z

From @khwilliamson

On 04/27/2017 12:32 PM, demerphq wrote:

But, I want to point out that this highlights a flaw in the perl
implementation. Here, a negative number is being used as a sentinel. But in
most of the implementation, code points can be any UV. If a user has just
the right code point point it would be treated as negative and conflated
with this sentinel. There are other situations where negatives are
similarly used. Because of this, code points outside the range 0..IV_MAX
are deprecated, and are scheduled to be forbidden in 5.28.
I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.

If I am correct in this then I believe the concern you raise here is
not a problem, where it would be if it held a code-point.

One of us should double check, or maybe Dave can speakup on this.

cheers,
Yves

You're right. I didn't look at the code here before writing. But there
are places in regexec.c where the sentinels can be conflated with an
input code point.

I'm now thinking the patch would be better to be

&& !NEXTCHR_IS_EOS

instead of what you have:

&& nextchr >= 0 /* guard against negative EOS value in nextchr */

The macro is:

#define NEXTCHR_IS_EOS (nextchr < 0)

so it evaluates to the exact same thing as your patch, but uses the
paradigm that was created by Dave for this situation, which is used in
all other similar cases in this file.

p5pRT · 2017-04-27T19:48:26Z

From @khwilliamson

On 04/27/2017 01:42 PM, saint.snit@gmail.com wrote:

If there were any doubt that this patch could have a secondary
side-effect beside fixing this bug I would agree it should wait, and
again, if this had be present in 5.24 I would say it could wait.

Not to advocate for or against the patch's inclusion, but the bug has
existed since at least 5.22.2, which is where I discovered it.

Hmmm. I tried 5.24.1 before I said it was introduced in 5.25.

p5pRT · 2017-04-27T22:27:11Z

From saint.snit@gmail.com

Hmmm. I tried 5.24.1 before I said it was introduced in 5.25.

Upthread, Zefram said it was introduced in v5.21.7-165-g613abc6.
However, this was where the warning was introduced -- the underlying
misinterpretation of EOS that the warning catches may have existed
silently before that. (I suppose it's a philosophical question whether
a coding error that produces no incorrect behavior is still a bug.)

p5pRT · 2017-04-27T22:34:22Z

From zefram@fysh.org

saint.snit@gmail.com wrote:

However, this was where the warning was introduced -- the underlying
misinterpretation of EOS that the warning catches may have existed
silently before that.

No. The misinterpretation of EOS exists only in the code that checks
whether to emit the warning, the code that was introduced in the commit
that I identified.

-zefram

p5pRT · 2017-04-28T06:55:59Z

From @demerphq

If it has been present prior to 5.25 then I will revert. Yves

On 28 Apr 2017 00:34, "Zefram" <zefram@fysh.org> wrote:

saint.snit@gmail.com wrote:

However, this was where the warning was introduced -- the underlying
misinterpretation of EOS that the warning catches may have existed
silently before that.

No. The misinterpretation of EOS exists only in the code that checks
whether to emit the warning, the code that was introduced in the commit
that I identified.

-zefram

p5pRT · 2017-04-28T07:26:24Z

From @demerphq

On 28 April 2017 at 08:55, demerphq <demerphq@gmail.com> wrote:

On 28 Apr 2017 00:34, "Zefram" <zefram@fysh.org> wrote:

saint.snit@gmail.com wrote:

However, this was where the warning was introduced -- the underlying
misinterpretation of EOS that the warning catches may have existed
silently before that.

No. The misinterpretation of EOS exists only in the code that checks
whether to emit the warning, the code that was introduced in the commit
that I identified.

If it has been present prior to 5.25 then I will revert.

Since this was present pre 5.25 I have reverted with
b497237

It still feels kinda wrong but whatever.

Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2017-04-28T08:44:39Z

From @iabyn

On Thu, Apr 27, 2017 at 08:32:52PM +0200, demerphq wrote:

I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.

Yep, at the top of the main loop is this:

SET_nextchr;
assert(nextchr < 256 && (nextchr >= 0 || nextchr == NEXTCHR_EOS));

On Thu, Apr 27, 2017 at 01:45:46PM -0600, Karl Williamson wrote:

I'm now thinking the patch would be better to be

&& !NEXTCHR_IS_EOS

instead of what you have:

&& nextchr >= 0 /* guard against negative EOS value in nextchr */

The macro is:

#define NEXTCHR_IS_EOS (nextchr < 0)

so it evaluates to the exact same thing as your patch, but uses the paradigm
that was created by Dave for this situation, which is used in all other
similar cases in this file.

+1

--
Please note that ash-trays are provided for the use of smokers,
whereas the floor is provided for the use of all patrons.
-- Bill Royston

p5pRT · 2017-04-28T08:57:50Z

From @demerphq

On 28 April 2017 at 10:44, Dave Mitchell <davem@iabyn.com> wrote:

On Thu, Apr 27, 2017 at 08:32:52PM +0200, demerphq wrote:

I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.

Yep, at the top of the main loop is this:
    SET\_nextchr;
    assert$nextchr \< 256 && \(nextchr >= 0 || nextchr == NEXTCHR\_EOS$\);

Thanks, so that is taken care of for now.

On Thu, Apr 27, 2017 at 01:45:46PM -0600, Karl Williamson wrote:

I'm now thinking the patch would be better to be

&& !NEXTCHR_IS_EOS

instead of what you have:

&& nextchr >= 0 /* guard against negative EOS value in nextchr */

The macro is:

#define NEXTCHR_IS_EOS (nextchr < 0)

so it evaluates to the exact same thing as your patch, but uses the paradigm
that was created by Dave for this situation, which is used in all other
similar cases in this file.

+1

For what its worth I dont mind doing this but I don't like it much. I
would prefer that we create a define like

#define NEXTCHR_IS_OCTET (nextchr>=0)

and then use that. That way we can distinguish between "not end of
string" and "is a octet" in our code, particularly if we add a new
special value for nextchr.

The other approach BTW would be to add a new UTF8 define that does not
cast its argument, and use that instead.

Or do both. :-)

Anyway, since this bug was present prior to 5.25 I have reverted my
patch entirely. We have until 5.26 is out to work out a better patch.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT · 2017-04-28T12:33:03Z

From @xsawyerx

On 04/28/2017 10:57 AM, demerphq wrote:

On 28 April 2017 at 10:44, Dave Mitchell <davem@iabyn.com> wrote:
On Thu, Apr 27, 2017 at 08:32:52PM +0200, demerphq wrote:

I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.
Yep, at the top of the main loop is this:
    SET\_nextchr;
    assert$nextchr \< 256 && \(nextchr >= 0 || nextchr == NEXTCHR\_EOS$\);
Thanks, so that is taken care of for now.

On Thu, Apr 27, 2017 at 01:45:46PM -0600, Karl Williamson wrote:

I'm now thinking the patch would be better to be

&& !NEXTCHR_IS_EOS
instead of what you have:

&& nextchr >= 0 /* guard against negative EOS value in nextchr */
The macro is:

#define NEXTCHR_IS_EOS (nextchr < 0)

so it evaluates to the exact same thing as your patch, but uses the paradigm
that was created by Dave for this situation, which is used in all other
similar cases in this file.
+1
For what its worth I dont mind doing this but I don't like it much. I
would prefer that we create a define like

#define NEXTCHR_IS_OCTET (nextchr>=0)

and then use that. That way we can distinguish between "not end of
string" and "is a octet" in our code, particularly if we add a new
special value for nextchr.

The other approach BTW would be to add a new UTF8 define that does not
cast its argument, and use that instead.

Or do both. :-)

Anyway, since this bug was present prior to 5.25 I have reverted my
patch entirely. We have until 5.26 is out to work out a better patch.

Thanks, everyone! :)

p5pRT · 2018-03-07T16:43:27Z

From @khwilliamson

This was fixed by 2c2da8e but the ticket didn't get closed then
--
Karl Williamson

p5pRT · 2018-03-07T16:43:27Z

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT · 2018-06-23T15:30:00Z

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release yesterday of Perl 5.28.0, this and 185 other issues have been
resolved.

Perl 5.28.0 may be downloaded via:
https://metacpan.org/release/XSAWYERX/perl-5.28.0

If you find that the problem persists, feel free to reopen this ticket.

p5pRT · 2018-06-23T15:30:01Z

@khwilliamson - Status changed from 'pending release' to 'resolved'

p5pRT closed this as completed Jun 23, 2018

p5pRT added the Severity Low label Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

erroneous regex warning after utf8 conversion #15958

erroneous regex warning after utf8 conversion #15958

p5pRT commented Apr 21, 2017

p5pRT commented Apr 21, 2017

p5pRT commented Apr 23, 2017

p5pRT commented Apr 23, 2017

p5pRT commented Apr 24, 2017

p5pRT commented Apr 24, 2017

p5pRT commented Apr 26, 2017

p5pRT commented Apr 26, 2017

p5pRT commented Apr 26, 2017

p5pRT commented Apr 26, 2017

p5pRT commented Apr 27, 2017

p5pRT commented Apr 27, 2017

p5pRT commented Apr 27, 2017

p5pRT commented Apr 27, 2017

p5pRT commented Apr 27, 2017

p5pRT commented Apr 27, 2017

p5pRT commented Apr 27, 2017

p5pRT commented Apr 27, 2017

p5pRT commented Apr 28, 2017

p5pRT commented Apr 28, 2017

p5pRT commented Apr 28, 2017

p5pRT commented Apr 28, 2017

p5pRT commented Apr 28, 2017

p5pRT commented Mar 7, 2018

p5pRT commented Mar 7, 2018

p5pRT commented Jun 23, 2018

p5pRT commented Jun 23, 2018

erroneous regex warning after utf8 conversion #15958

erroneous regex warning after utf8 conversion #15958

Comments

p5pRT commented Apr 21, 2017

p5pRT commented Apr 21, 2017

From saint.snit@gmail.com

p5pRT commented Apr 23, 2017

From zefram@fysh.org

p5pRT commented Apr 23, 2017

p5pRT commented Apr 24, 2017

From @khwilliamson

p5pRT commented Apr 24, 2017

From @demerphq

p5pRT commented Apr 26, 2017

From @khwilliamson

p5pRT commented Apr 26, 2017

From @demerphq

p5pRT commented Apr 26, 2017

From @khwilliamson

p5pRT commented Apr 26, 2017

From @demerphq

p5pRT commented Apr 27, 2017

From @xsawyerx

p5pRT commented Apr 27, 2017

From @khwilliamson

p5pRT commented Apr 27, 2017

From @demerphq

p5pRT commented Apr 27, 2017

From saint.snit@gmail.com

p5pRT commented Apr 27, 2017

From @khwilliamson

p5pRT commented Apr 27, 2017

From @khwilliamson

p5pRT commented Apr 27, 2017

From saint.snit@gmail.com

p5pRT commented Apr 27, 2017

From zefram@fysh.org

p5pRT commented Apr 28, 2017

From @demerphq

p5pRT commented Apr 28, 2017

From @demerphq

p5pRT commented Apr 28, 2017

From @iabyn

p5pRT commented Apr 28, 2017

From @demerphq

p5pRT commented Apr 28, 2017

From @xsawyerx

p5pRT commented Mar 7, 2018

From @khwilliamson

p5pRT commented Mar 7, 2018

p5pRT commented Jun 23, 2018

From @khwilliamson

p5pRT commented Jun 23, 2018