Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erroneous regex warning after utf8 conversion #15958

Closed
p5pRT opened this issue Apr 21, 2017 · 26 comments
Closed

erroneous regex warning after utf8 conversion #15958

p5pRT opened this issue Apr 21, 2017 · 26 comments
Labels

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Apr 21, 2017

Migrated from rt.perl.org#131190 (status was 'resolved')

Searchable as RT131190$

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 21, 2017

From saint.snit@gmail.com

This is a bug report for perl from saint.snit@​gmail.com,
generated with the help of perlbug 1.40 running under perl 5.22.2.


Perl emits an inapplicable warning for some regular expressions.
Here is a
code block that demonstrates the bug​:

  use experimental 'smartmatch';
  use POSIX 'locale_h';
  use locale '​:ctype';
  setlocale(LC_CTYPE, 'en_US');

  $_ = "x";
  utf8​::upgrade($_);
  /x(y|z)?/;

This produces the output​:

  Wide character (U+FFFD) in pattern match (m//) at - line 8.

even though no such character is ever specified.

The bug only seems to occur with the particular combination of "use"
statements above; for instance, it does not occur if the "smartmatch" line
is omitted, even though the code uses no smart matching.

(For reference, discussion about this bug originated at
http​://perlmonks.com/?node_id=1188156)

Thank you for your time.



Flags​:
  category=core
  severity=low


Site configuration information for perl 5.22.2​:

Configured by Gentoo at Thu Oct 20 22​:32​:43 CDT 2016.

Summary of my perl5 (revision 5 version 22 subversion 2) configuration​:
 
  Platform​:
  osname=linux, osvers=4.4.21-gentoo-evo, archname=i686-linux
  uname='linux evo 4.4.21-gentoo-evo #1 smp sun oct 9 21​:43​:27 cdt
2016 i686 intel(r) pentium(r) 4 cpu 1.70ghz genuineintel gnulinux '
  config_args='-des -Duseshrplib -Darchname=i686-linux
-Dcc=i686-pc-linux-gnu-gcc -Doptimize=-O2 -march=native -pipe
-Dldflags=-Wl,-O1 -Wl,--as-needed -Dprefix=/usr -Dinstallprefix=/usr
-Dsiteprefix=/usr/local -Dvendorprefix=/usr -Dscriptdir=/usr/bin
-Dprivlib=/usr/lib/perl5/5.22.2 -Darchlib=/usr/lib/perl5/5.22.2/i686-linux
-Dsitelib=/usr/local/lib/perl5/5.22.2
-Dsitearch=/usr/local/lib/perl5/5.22.2/i686-linux
-Dvendorlib=/usr/lib/perl5/vendor_perl/5.22.2
-Dvendorarch=/usr/lib/perl5/vendor_perl/5.22.2/i686-linux
-Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3
-Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3
-Dvendorman1dir=/usr/share/man/man1 -Dvendorman3dir=/usr/share/man/man3
-Dman1ext=1 -Dman3ext=3pm -Dlibperl=libperl.so.5.22.2
-Dlocincpth=/usr/include -Dglibpth=/lib /usr/lib -Duselargefiles
-Dd_semctl_semun -Dcf_by=Gentoo -Dmyhostname=localhost
-Dperladmin=root@​localhost -Dinstallusrbinperl=n -Ud_csh -Uusenm -Di_ndbm
-Di_gdbm -Di_db -DDEBUGGING=none -Dinc_version_list=5.22.0/i686-linux
5.22.0 5.22.1/i686-linux 5.22.1 -Dnoextensions=ODBM_File'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='i686-pc-linux-gnu-gcc', ccflags ='-fwrapv -fno-strict-aliasing
-pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  optimize='-O2 -march=native -pipe',
  cppflags='-fwrapv -fno-strict-aliasing -pipe'
  ccversion='', gccversion='4.9.3', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234,
doublekind=3
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12,
longdblkind=3
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='i686-pc-linux-gnu-gcc', ldflags ='-Wl,-O1 -Wl,--as-needed'
  libpth=/usr/lib/gcc/i686-pc-linux-gnu/4.9.3/include-fixed /usr/lib
/lib/../lib /usr/lib/../lib /lib
  libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc
-lgdbm_compat
  perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
  libc=libc-2.22.so, so=so, useshrplib=true, libperl=libperl.so.5.22.2
  gnulibc_version='2.22'
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
  cccdlflags='-fPIC', lddlflags='-shared -O2 -march=native -pipe
-Wl,-O1 -Wl,--as-needed'

Locally applied patches​:
  gentoo/hints_hpux - Fix hpux hints
  gentoo/aix_soname - aix gcc detection and shared library soname
support
  gentoo/EUMM-RUNPATH - https://bugs.gentoo.org/105054
cpan/ExtUtils-MakeMaker​: drop $PORTAGE_TMPDIR from LD_RUN_PATH
  gentoo/config_over - Remove -rpath and append LDFLAGS to lddlflags
  gentoo/opensolaris_headers - [PATCH] Add headers for opensolaris
  gentoo/patchlevel - List packaged patches for perl-5.22.2(#1)
in patchlevel.h
  gentoo/cpanplus_definstalldirs - Configure CPANPLUS to use the site
directories by default.
  gentoo/cleanup-paths - [PATCH] Cleanup PATH and shrpenv
  gentoo/enc2xs - Tweak enc2xs to follow symlinks and ignore missing
@​INC directories.
  gentoo/darwin-cc-ld - https://bugs.gentoo.org/297751 [PATCH] darwin​:
Use $CC to link
  gentoo/cpan_definstalldirs - Provide a sensible INSTALLDIRS default
for modules installed from CPAN.
  gentoo/interix - [PATCH] Fix interix hints
  gentoo/create_libperl_soname - https://bugs.gentoo.org/286840 [PATCH]
Set libperl soname
  gentoo/mod_paths - Add /etc/perl to @​INC
  gentoo/EUMM_perllocalpod -
  gentoo/drop_fstack_protector - https://bugs.gentoo.org/348557 [PATCH]
Don't force -fstack-protector on everyone
  gentoo/usr_local - [PATCH] Remove /usr/local paths
  gentoo/D-SHA-CFLAGS - https://bugs.gentoo.org/506818 [PATCH] [PATCH]
Do not set custom CFLAGS in cpan/Digest-SHA
  gentoo/io_socket_ip_tests -
  gentoo/tests.patch -
  debian/cpan-missing-site-dirs - Fix CPAN​::FirstTime defaults with
nonexisting site dirs if a parent is writable
  debian/makemaker-pasthru - Pass LD settings through to subdirectories
  debian/locale-robustness - [perl #124310] Make t/run/locale.t survive
missing locales masked by LC_ALL
  fixes/memoize_storable_nstore - [rt.cpan.org #77790]
Memoize​::Storable​: respect 'nstore' option not respected
  fixes/podman-pipe - Better errors for man pages from standard input
  fixes/respect_umask - Respect umask during installation
  fixes/podman-utc - Make the embedded date from Pod​::Man reproducible
  fixes/podman-utc-docs - Documentation and test suite updates for
UTC fix
  fixes/net_smtp_docs - [rt.cpan.org #36038] Document the Net​::SMTP
'Port' option
  fixes/document_makemaker_ccflags - [rt.cpan.org #68613] Document
that CCFLAGS should include $Config{ccflags}


@​INC for perl 5.22.2​:
  /etc/perl
  /usr/local/lib/perl5/5.22.2/i686-linux
  /usr/local/lib/perl5/5.22.2
  /usr/lib/perl5/vendor_perl/5.22.2/i686-linux
  /usr/lib/perl5/vendor_perl/5.22.2
  /usr/local/lib/perl5
  /usr/lib/perl5/vendor_perl
  /usr/lib/perl5/5.22.2/i686-linux
  /usr/lib/perl5/5.22.2
  .


Environment for perl 5.22.2​:
  HOME=/home/vax
  LANG=C
  LANGUAGE (unset)
  LC_CTYPE=en_US.iso88591
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/home/vax/bin​:/usr/local/bin​:/usr/bin​:/bin​:/opt/bin​:/usr/i686-pc-linux-gnu/gcc-bin/4.9.3​:/sbin​:/usr/sbin​:/usr/games/bin​:./bin
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 23, 2017

From zefram@fysh.org

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter​:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at ../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

-zefram

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 23, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 24, 2017

From @khwilliamson

On 04/23/2017 01​:33 PM, Zefram wrote​:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter​:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at ../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

Actually, the decode function shouldn't be getting called at all if
there is nothing to decode.

-zefram

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 24, 2017

From @demerphq

On 24 April 2017 at 06​:19, Karl Williamson <public@​khwilliamson.com> wrote​:

On 04/23/2017 01​:33 PM, Zefram wrote​:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter​:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at
../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

Actually, the decode function shouldn't be getting called at all if there is
nothing to decode.

I pushed a fix for this yesterday.

Dave M changed the var "nextchr" so that it supported a negative
argument beyond its normal duties as holding the codepoint of the
nextchar in the regex matching. This negative argument, -10, is used
to represent the end of string. (All very reasonable.)

Unfortunately this doesn't play entirely nicely with the utf8 code
which for other generally good reasons tends to cast its arguments to
U8 (to deal with "char" inputs). So the code​:

  if (utf8_target
  && UTF8_IS_ABOVE_LATIN1(nextchr)

sees the -10, casts it to a U8, producing 255, and then considers it
to be the first octet of a (malformed presumably) utf8 sequence.

The fix was to change this to

  if (utf8_target
  && nextchr >= 0
  && UTF8_IS_ABOVE_LATIN1(nextchr)

I was going to push this to a branch etc, but I managed to make a big
mess of things yesterday and pushed it to trunk along with a bunch of
other junk stuff which was meant for after the code freeze.

After running out of time cleaning up the stuff that definitely
shouldnt be part of this next release, and being somewhat unwilling to
allow a perl to be released with such an obvious bug and obviously
safe bugfix unapplied I decided to leave it for today to sort out.

So this issues is fixed and understood, but the exact status of the
patch is a bit up in the air. If sawyer says to revert it then I will,
but I feel this patch is safe enough to be part of the release.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 26, 2017

From @khwilliamson

On 04/24/2017 01​:20 AM, demerphq wrote​:

On 24 April 2017 at 06​:19, Karl Williamson <public@​khwilliamson.com> wrote​:

On 04/23/2017 01​:33 PM, Zefram wrote​:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter​:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at
../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

Actually, the decode function shouldn't be getting called at all if there is
nothing to decode.

I pushed a fix for this yesterday.

Dave M changed the var "nextchr" so that it supported a negative
argument beyond its normal duties as holding the codepoint of the
nextchar in the regex matching. This negative argument, -10, is used
to represent the end of string. (All very reasonable.)

Unfortunately this doesn't play entirely nicely with the utf8 code
which for other generally good reasons tends to cast its arguments to
U8 (to deal with "char" inputs). So the code​:

                 if \(utf8\_target
                    && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

sees the -10, casts it to a U8, producing 255, and then considers it
to be the first octet of a (malformed presumably) utf8 sequence.

The fix was to change this to

                 if \(utf8\_target
                    && nextchr >= 0
                    && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

I was going to push this to a branch etc, but I managed to make a big
mess of things yesterday and pushed it to trunk along with a bunch of
other junk stuff which was meant for after the code freeze.

After running out of time cleaning up the stuff that definitely
shouldnt be part of this next release, and being somewhat unwilling to
allow a perl to be released with such an obvious bug and obviously
safe bugfix unapplied I decided to leave it for today to sort out.

So this issues is fixed and understood, but the exact status of the
patch is a bit up in the air. If sawyer says to revert it then I will,
but I feel this patch is safe enough to be part of the release.

Yves

Note that no test for this problem has been committed.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 26, 2017

From @demerphq

No. Because we are debating if it should be reverted due to code freeze. I
personally think this is sufficiently isolated we should keep it. I am
waiting on sawyer to make a call...

Yves

On 26 Apr 2017 6​:17 p.m., "Karl Williamson" <public@​khwilliamson.com> wrote​:

On 04/24/2017 01​:20 AM, demerphq wrote​:

On 24 April 2017 at 06​:19, Karl Williamson <public@​khwilliamson.com>
wrote​:

On 04/23/2017 01​:33 PM, Zefram wrote​:

Bisecting shows that the warning started appearing for that test script
at v5.21.7-165-g613abc6 "Raise warning on multi-byte char in single-byte
locale".

Attempting to minimise the test script, it turns out that the "use
experimental" line is not required for any reason relating to
smartmatch,
but simply for its effect on lexical warning flags. Anything touching
lexical warnings will do, such as the simpler "use warnings". And thus
enabling all warnings produces an additional warning that sheds some
light on the matter​:

$ perl ../rt131190
Malformed UTF-8 character (empty string) in pattern match (m//) at
../rt131190 line 8.
Wide character (U+FFFD) in pattern match (m//) at ../rt131190 line 8.

Looks like the problem is that the check for wide characters should be
passing in the UTF8_ALLOW_EMPTY flag. Without this, when it's at end
of string it perceives a malformed character, for which it warns about
malformation and substitutes in a replacement character, which is wide
and therefore triggers the wide character warning.

Actually, the decode function shouldn't be getting called at all if
there is
nothing to decode.

I pushed a fix for this yesterday.

Dave M changed the var "nextchr" so that it supported a negative
argument beyond its normal duties as holding the codepoint of the
nextchar in the regex matching. This negative argument, -10, is used
to represent the end of string. (All very reasonable.)

Unfortunately this doesn't play entirely nicely with the utf8 code
which for other generally good reasons tends to cast its arguments to
U8 (to deal with "char" inputs). So the code​:

                 if \(utf8\_target
                    && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

sees the -10, casts it to a U8, producing 255, and then considers it
to be the first octet of a (malformed presumably) utf8 sequence.

The fix was to change this to

                 if \(utf8\_target
                    && nextchr >= 0
                    && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

I was going to push this to a branch etc, but I managed to make a big
mess of things yesterday and pushed it to trunk along with a bunch of
other junk stuff which was meant for after the code freeze.

After running out of time cleaning up the stuff that definitely
shouldnt be part of this next release, and being somewhat unwilling to
allow a perl to be released with such an obvious bug and obviously
safe bugfix unapplied I decided to leave it for today to sort out.

So this issues is fixed and understood, but the exact status of the
patch is a bit up in the air. If sawyer says to revert it then I will,
but I feel this patch is safe enough to be part of the release.

Yves

Note that no test for this problem has been committed.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 26, 2017

From @khwilliamson

On 04/26/2017 10​:52 AM, demerphq wrote​:

No. Because we are debating if it should be reverted due to code freeze.
I personally think this is sufficiently isolated we should keep it. I am
waiting on sawyer to make a call...

Yves

Right, but if we keep it, shouldn't we also push a test?

Note that this is a regression introduced in 5.25.

On 26 Apr 2017 6​:17 p.m., "Karl Williamson" <public@​khwilliamson.com
<mailto​:public@​khwilliamson.com>> wrote​:

On 04/24/2017 01&#8203;:20 AM\, demerphq wrote&#8203;:

    On 24 April 2017 at 06&#8203;:19\, Karl Williamson
    \<public@&#8203;khwilliamson\.com \<mailto&#8203;:public@&#8203;khwilliamson\.com>> wrote&#8203;:

        On 04/23/2017 01&#8203;:33 PM\, Zefram wrote&#8203;:


            Bisecting shows that the warning started appearing for
            that test script
            at v5\.21\.7\-165\-g613abc6 "Raise warning on multi\-byte
            char in single\-byte
            locale"\.

            Attempting to minimise the test script\, it turns out
            that the "use
            experimental" line is not required for any reason
            relating to smartmatch\,
            but simply for its effect on lexical warning flags\.
            Anything touching
            lexical warnings will do\, such as the simpler "use
            warnings"\.  And thus
            enabling all warnings produces an additional warning
            that sheds some
            light on the matter&#8203;:

            $ perl \.\./rt131190
            Malformed UTF\-8 character \(empty string\) in pattern
            match \(m//\) at
            \.\./rt131190 line 8\.
            Wide character \(U\+FFFD\) in pattern match \(m//\) at
            \.\./rt131190 line 8\.

            Looks like the problem is that the check for wide
            characters should be
            passing in the UTF8\_ALLOW\_EMPTY flag\.  Without this\,
            when it's at end
            of string it perceives a malformed character\, for which
            it warns about
            malformation and substitutes in a replacement character\,
            which is wide
            and therefore triggers the wide character warning\.


        Actually\, the decode function shouldn't be getting called at
        all if there is
        nothing to decode\.


    I pushed a fix for this yesterday\.

    Dave M changed the var "nextchr" so that it supported a negative
    argument beyond its normal duties as holding the codepoint of the
    nextchar in the regex matching\. This negative argument\, \-10\, is used
    to represent the end of string\. \(All very reasonable\.\)

    Unfortunately this doesn't play entirely nicely with the utf8 code
    which for other generally good reasons tends to cast its
    arguments to
    U8 \(to deal with "char" inputs\)\. So the code&#8203;:

                         if \(utf8\_target
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    sees the \-10\, casts it to a U8\, producing 255\, and then considers it
    to be the first octet of a \(malformed presumably\) utf8 sequence\.

    The fix was to change this to


                         if \(utf8\_target
                            && nextchr >= 0
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    I was going to push this to a branch etc\, but I managed to make
    a big
    mess of things yesterday and pushed it to trunk along with a
    bunch of
    other junk stuff which was meant for after the code freeze\.

    After running out of time cleaning up the stuff that definitely
    shouldnt be part of this next release\, and being somewhat
    unwilling to
    allow a perl to be released with such an obvious bug and obviously
    safe bugfix unapplied I decided to leave it for today to sort out\.

    So this issues is fixed and understood\, but the exact status of the
    patch is a bit up in the air\. If sawyer says to revert it then I
    will\,
    but I feel this patch is safe enough to be part of the release\.

    Yves


Note that no test for this problem has been committed\.
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 26, 2017

From @demerphq

On 26 Apr 2017 7​:21 p.m., "Karl Williamson" <public@​khwilliamson.com> wrote​:

On 04/26/2017 10​:52 AM, demerphq wrote​:

No. Because we are debating if it should be reverted due to code freeze.
I personally think this is sufficiently isolated we should keep it. I am
waiting on sawyer to make a call...

Yves

Right, but if we keep it, shouldn't we also push a test?

Note that this is a regression introduced in 5.25

I already messed up the change log with a bunch of reverts. Id rather not
add another patch if it's just going to be reverted. So if sawyer approves
this change despite the code freeze then we can add a test, otherwise let's
have less to revert.

But one way or another we will need a test for sure. ☺️

Yves

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 27, 2017

From @xsawyerx

Just for the official record (and relating to James' email), I'm still
looking into whether to keep it or not.

(I'd rather discuss it with Yves instead of making a single-sided decision.)

Everyone, feel free to provide your opinion. Just keep it polite. :)

On 04/26/2017 06​:52 PM, demerphq wrote​:

No. Because we are debating if it should be reverted due to code
freeze. I personally think this is sufficiently isolated we should
keep it. I am waiting on sawyer to make a call...

Yves

On 26 Apr 2017 6​:17 p.m., "Karl Williamson" <public@​khwilliamson.com
<mailto​:public@​khwilliamson.com>> wrote​:

On 04/24/2017 01&#8203;:20 AM\, demerphq wrote&#8203;:

    On 24 April 2017 at 06&#8203;:19\, Karl Williamson
    \<public@&#8203;khwilliamson\.com \<mailto&#8203;:public@&#8203;khwilliamson\.com>> wrote&#8203;:

        On 04/23/2017 01&#8203;:33 PM\, Zefram wrote&#8203;:


            Bisecting shows that the warning started appearing for
            that test script
            at v5\.21\.7\-165\-g613abc6 "Raise warning on multi\-byte
            char in single\-byte
            locale"\.

            Attempting to minimise the test script\, it turns out
            that the "use
            experimental" line is not required for any reason
            relating to smartmatch\,
            but simply for its effect on lexical warning flags\. 
            Anything touching
            lexical warnings will do\, such as the simpler "use
            warnings"\.  And thus
            enabling all warnings produces an additional warning
            that sheds some
            light on the matter&#8203;:

            $ perl \.\./rt131190
            Malformed UTF\-8 character \(empty string\) in pattern
            match \(m//\) at
            \.\./rt131190 line 8\.
            Wide character \(U\+FFFD\) in pattern match \(m//\) at
            \.\./rt131190 line 8\.

            Looks like the problem is that the check for wide
            characters should be
            passing in the UTF8\_ALLOW\_EMPTY flag\.  Without this\,
            when it's at end
            of string it perceives a malformed character\, for
            which it warns about
            malformation and substitutes in a replacement
            character\, which is wide
            and therefore triggers the wide character warning\.


        Actually\, the decode function shouldn't be getting called
        at all if there is
        nothing to decode\.


    I pushed a fix for this yesterday\.

    Dave M changed the var "nextchr" so that it supported a negative
    argument beyond its normal duties as holding the codepoint of the
    nextchar in the regex matching\. This negative argument\, \-10\,
    is used
    to represent the end of string\. \(All very reasonable\.\)

    Unfortunately this doesn't play entirely nicely with the utf8 code
    which for other generally good reasons tends to cast its
    arguments to
    U8 \(to deal with "char" inputs\)\. So the code&#8203;:

                         if \(utf8\_target
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    sees the \-10\, casts it to a U8\, producing 255\, and then
    considers it
    to be the first octet of a \(malformed presumably\) utf8 sequence\.

    The fix was to change this to


                         if \(utf8\_target
                            && nextchr >= 0
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    I was going to push this to a branch etc\, but I managed to
    make a big
    mess of things yesterday and pushed it to trunk along with a
    bunch of
    other junk stuff which was meant for after the code freeze\.

    After running out of time cleaning up the stuff that definitely
    shouldnt be part of this next release\, and being somewhat
    unwilling to
    allow a perl to be released with such an obvious bug and obviously
    safe bugfix unapplied I decided to leave it for today to sort out\.

    So this issues is fixed and understood\, but the exact status
    of the
    patch is a bit up in the air\. If sawyer says to revert it then
    I will\,
    but I feel this patch is safe enough to be part of the release\.

    Yves


Note that no test for this problem has been committed\.
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 27, 2017

From @khwilliamson

On 04/27/2017 05​:13 AM, Sawyer X wrote​:

Just for the official record (and relating to James' email), I'm still
looking into whether to keep it or not.

(I'd rather discuss it with Yves instead of making a single-sided decision.)

Everyone, feel free to provide your opinion. Just keep it polite. :)

I think the patch should stay. It is extremely low risk, and fixes a
regression and hence would be eligible for 5.26.1 anyway.

But, I want to point out that this highlights a flaw in the perl
implementation. Here, a negative number is being used as a sentinel.
But in most of the implementation, code points can be any UV. If a user
has just the right code point point it would be treated as negative and
conflated with this sentinel. There are other situations where
negatives are similarly used. Because of this, code points outside the
range 0..IV_MAX are deprecated, and are scheduled to be forbidden in 5.28.

On 04/26/2017 06​:52 PM, demerphq wrote​:

No. Because we are debating if it should be reverted due to code
freeze. I personally think this is sufficiently isolated we should
keep it. I am waiting on sawyer to make a call...

Yves

On 26 Apr 2017 6​:17 p.m., "Karl Williamson" <public@​khwilliamson.com
<mailto​:public@​khwilliamson.com>> wrote​:

On 04/24/2017 01&#8203;:20 AM\, demerphq wrote&#8203;:

    On 24 April 2017 at 06&#8203;:19\, Karl Williamson
    \<public@&#8203;khwilliamson\.com \<mailto&#8203;:public@&#8203;khwilliamson\.com>> wrote&#8203;:

        On 04/23/2017 01&#8203;:33 PM\, Zefram wrote&#8203;:


            Bisecting shows that the warning started appearing for
            that test script
            at v5\.21\.7\-165\-g613abc6 "Raise warning on multi\-byte
            char in single\-byte
            locale"\.

            Attempting to minimise the test script\, it turns out
            that the "use
            experimental" line is not required for any reason
            relating to smartmatch\,
            but simply for its effect on lexical warning flags\.
            Anything touching
            lexical warnings will do\, such as the simpler "use
            warnings"\.  And thus
            enabling all warnings produces an additional warning
            that sheds some
            light on the matter&#8203;:

            $ perl \.\./rt131190
            Malformed UTF\-8 character \(empty string\) in pattern
            match \(m//\) at
            \.\./rt131190 line 8\.
            Wide character \(U\+FFFD\) in pattern match \(m//\) at
            \.\./rt131190 line 8\.

            Looks like the problem is that the check for wide
            characters should be
            passing in the UTF8\_ALLOW\_EMPTY flag\.  Without this\,
            when it's at end
            of string it perceives a malformed character\, for
            which it warns about
            malformation and substitutes in a replacement
            character\, which is wide
            and therefore triggers the wide character warning\.


        Actually\, the decode function shouldn't be getting called
        at all if there is
        nothing to decode\.


    I pushed a fix for this yesterday\.

    Dave M changed the var "nextchr" so that it supported a negative
    argument beyond its normal duties as holding the codepoint of the
    nextchar in the regex matching\. This negative argument\, \-10\,
    is used
    to represent the end of string\. \(All very reasonable\.\)

    Unfortunately this doesn't play entirely nicely with the utf8 code
    which for other generally good reasons tends to cast its
    arguments to
    U8 \(to deal with "char" inputs\)\. So the code&#8203;:

                         if \(utf8\_target
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    sees the \-10\, casts it to a U8\, producing 255\, and then
    considers it
    to be the first octet of a \(malformed presumably\) utf8 sequence\.

    The fix was to change this to


                         if \(utf8\_target
                            && nextchr >= 0
                            && UTF8\_IS\_ABOVE\_LATIN1\(nextchr\)

    I was going to push this to a branch etc\, but I managed to
    make a big
    mess of things yesterday and pushed it to trunk along with a
    bunch of
    other junk stuff which was meant for after the code freeze\.

    After running out of time cleaning up the stuff that definitely
    shouldnt be part of this next release\, and being somewhat
    unwilling to
    allow a perl to be released with such an obvious bug and obviously
    safe bugfix unapplied I decided to leave it for today to sort out\.

    So this issues is fixed and understood\, but the exact status
    of the
    patch is a bit up in the air\. If sawyer says to revert it then
    I will\,
    but I feel this patch is safe enough to be part of the release\.

    Yves


Note that no test for this problem has been committed\.
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 27, 2017

From @demerphq

On 27 April 2017 at 19​:35, Karl Williamson <public@​khwilliamson.com> wrote​:

On 04/27/2017 05​:13 AM, Sawyer X wrote​:

Just for the official record (and relating to James' email), I'm still
looking into whether to keep it or not.

(I'd rather discuss it with Yves instead of making a single-sided
decision.)

Everyone, feel free to provide your opinion. Just keep it polite. :)

I think the patch should stay. It is extremely low risk, and fixes a
regression and hence would be eligible for 5.26.1 anyway.

Summarizes my view too. We regressed in 5.25.x on this, and despite
the fact we only noticed during the code freeze give the fix is as
safe as a fix can be I see no point in rolling out a known regression
in 5.26.x.

If there were any doubt that this patch could have a secondary
side-effect beside fixing this bug I would agree it should wait, and
again, if this had be present in 5.24 I would say it could wait. But
as is rolling out 5.26.x with this patch reverted means we are
knowingly rolling out 5.26.x with localized regexes broken. Given the
extremely low risk this seems extremely irresponsible.

But, I want to point out that this highlights a flaw in the perl
implementation. Here, a negative number is being used as a sentinel. But in
most of the implementation, code points can be any UV. If a user has just
the right code point point it would be treated as negative and conflated
with this sentinel. There are other situations where negatives are
similarly used. Because of this, code points outside the range 0..IV_MAX
are deprecated, and are scheduled to be forbidden in 5.28.

I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.

If I am correct in this then I believe the concern you raise here is
not a problem, where it would be if it held a code-point.

One of us should double check, or maybe Dave can speakup on this.

cheers,
Yves

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 27, 2017

From saint.snit@gmail.com

If there were any doubt that this patch could have a secondary
side-effect beside fixing this bug I would agree it should wait, and
again, if this had be present in 5.24 I would say it could wait.

Not to advocate for or against the patch's inclusion, but the bug has
existed since at least 5.22.2, which is where I discovered it.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 27, 2017

From @khwilliamson

On 04/27/2017 12​:32 PM, demerphq wrote​:

But, I want to point out that this highlights a flaw in the perl
implementation. Here, a negative number is being used as a sentinel. But in
most of the implementation, code points can be any UV. If a user has just
the right code point point it would be treated as negative and conflated
with this sentinel. There are other situations where negatives are
similarly used. Because of this, code points outside the range 0..IV_MAX
are deprecated, and are scheduled to be forbidden in 5.28.
I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.

If I am correct in this then I believe the concern you raise here is
not a problem, where it would be if it held a code-point.

One of us should double check, or maybe Dave can speakup on this.

cheers,
Yves

You're right. I didn't look at the code here before writing. But there
are places in regexec.c where the sentinels can be conflated with an
input code point.

I'm now thinking the patch would be better to be

&& !NEXTCHR_IS_EOS

instead of what you have​:

&& nextchr >= 0 /* guard against negative EOS value in nextchr */

The macro is​:

#define NEXTCHR_IS_EOS (nextchr < 0)

so it evaluates to the exact same thing as your patch, but uses the
paradigm that was created by Dave for this situation, which is used in
all other similar cases in this file.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 27, 2017

From @khwilliamson

On 04/27/2017 01​:42 PM, saint.snit@​gmail.com wrote​:

If there were any doubt that this patch could have a secondary
side-effect beside fixing this bug I would agree it should wait, and
again, if this had be present in 5.24 I would say it could wait.

Not to advocate for or against the patch's inclusion, but the bug has
existed since at least 5.22.2, which is where I discovered it.

Hmmm. I tried 5.24.1 before I said it was introduced in 5.25.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 27, 2017

From saint.snit@gmail.com

Hmmm. I tried 5.24.1 before I said it was introduced in 5.25.

Upthread, Zefram said it was introduced in v5.21.7-165-g613abc6.
However, this was where the warning was introduced -- the underlying
misinterpretation of EOS that the warning catches may have existed
silently before that. (I suppose it's a philosophical question whether
a coding error that produces no incorrect behavior is still a bug.)

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 27, 2017

From zefram@fysh.org

saint.snit@​gmail.com wrote​:

However, this was where the warning was introduced -- the underlying
misinterpretation of EOS that the warning catches may have existed
silently before that.

No. The misinterpretation of EOS exists only in the code that checks
whether to emit the warning, the code that was introduced in the commit
that I identified.

-zefram

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 28, 2017

From @demerphq

If it has been present prior to 5.25 then I will revert. Yves

On 28 Apr 2017 00​:34, "Zefram" <zefram@​fysh.org> wrote​:

saint.snit@​gmail.com wrote​:

However, this was where the warning was introduced -- the underlying
misinterpretation of EOS that the warning catches may have existed
silently before that.

No. The misinterpretation of EOS exists only in the code that checks
whether to emit the warning, the code that was introduced in the commit
that I identified.

-zefram

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 28, 2017

From @demerphq

On 28 April 2017 at 08​:55, demerphq <demerphq@​gmail.com> wrote​:

On 28 Apr 2017 00​:34, "Zefram" <zefram@​fysh.org> wrote​:

saint.snit@​gmail.com wrote​:

However, this was where the warning was introduced -- the underlying
misinterpretation of EOS that the warning catches may have existed
silently before that.

No. The misinterpretation of EOS exists only in the code that checks
whether to emit the warning, the code that was introduced in the commit
that I identified.

If it has been present prior to 5.25 then I will revert.

Since this was present pre 5.25 I have reverted with
b497237

It still feels kinda wrong but whatever.

Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 28, 2017

From @iabyn

On Thu, Apr 27, 2017 at 08​:32​:52PM +0200, demerphq wrote​:

I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.

Yep, at the top of the main loop is this​:

  SET_nextchr;
  assert(nextchr < 256 && (nextchr >= 0 || nextchr == NEXTCHR_EOS));

On Thu, Apr 27, 2017 at 01​:45​:46PM -0600, Karl Williamson wrote​:

I'm now thinking the patch would be better to be

&& !NEXTCHR_IS_EOS

instead of what you have​:

&& nextchr >= 0 /* guard against negative EOS value in nextchr */

The macro is​:

#define NEXTCHR_IS_EOS (nextchr < 0)

so it evaluates to the exact same thing as your patch, but uses the paradigm
that was created by Dave for this situation, which is used in all other
similar cases in this file.

+1

--
Please note that ash-trays are provided for the use of smokers,
whereas the floor is provided for the use of all patrons.
  -- Bill Royston

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 28, 2017

From @demerphq

On 28 April 2017 at 10​:44, Dave Mitchell <davem@​iabyn.com> wrote​:

On Thu, Apr 27, 2017 at 08​:32​:52PM +0200, demerphq wrote​:

I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.

Yep, at the top of the main loop is this​:

    SET\_nextchr;
    assert\(nextchr \< 256 && \(nextchr >= 0 || nextchr == NEXTCHR\_EOS\)\);

Thanks, so that is taken care of for now.

On Thu, Apr 27, 2017 at 01​:45​:46PM -0600, Karl Williamson wrote​:

I'm now thinking the patch would be better to be

&& !NEXTCHR_IS_EOS

instead of what you have​:

&& nextchr >= 0 /* guard against negative EOS value in nextchr */

The macro is​:

#define NEXTCHR_IS_EOS (nextchr < 0)

so it evaluates to the exact same thing as your patch, but uses the paradigm
that was created by Dave for this situation, which is used in all other
similar cases in this file.

+1

For what its worth I dont mind doing this but I don't like it much. I
would prefer that we create a define like

#define NEXTCHR_IS_OCTET (nextchr>=0)

and then use that. That way we can distinguish between "not end of
string" and "is a octet" in our code, particularly if we add a new
special value for nextchr.

The other approach BTW would be to add a new UTF8 define that does not
cast its argument, and use that instead.

Or do both. :-)

Anyway, since this bug was present prior to 5.25 I have reverted my
patch entirely. We have until 5.26 is out to work out a better patch.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 28, 2017

From @xsawyerx

On 04/28/2017 10​:57 AM, demerphq wrote​:

On 28 April 2017 at 10​:44, Dave Mitchell <davem@​iabyn.com> wrote​:

On Thu, Apr 27, 2017 at 08​:32​:52PM +0200, demerphq wrote​:

I think I mispoke when I said that nexchr holds a codepoint, I think I
should have said it holds the next *octet*, or in other words my
understanding is nexchr is either negative, in which case it is -10,
otherwise it is restricted to 0..255.
Yep, at the top of the main loop is this​:

    SET\_nextchr;
    assert\(nextchr \< 256 && \(nextchr >= 0 || nextchr == NEXTCHR\_EOS\)\);

Thanks, so that is taken care of for now.

On Thu, Apr 27, 2017 at 01​:45​:46PM -0600, Karl Williamson wrote​:

I'm now thinking the patch would be better to be

&& !NEXTCHR_IS_EOS
instead of what you have​:

&& nextchr >= 0 /* guard against negative EOS value in nextchr */
The macro is​:

#define NEXTCHR_IS_EOS (nextchr < 0)

so it evaluates to the exact same thing as your patch, but uses the paradigm
that was created by Dave for this situation, which is used in all other
similar cases in this file.
+1
For what its worth I dont mind doing this but I don't like it much. I
would prefer that we create a define like

#define NEXTCHR_IS_OCTET (nextchr>=0)

and then use that. That way we can distinguish between "not end of
string" and "is a octet" in our code, particularly if we add a new
special value for nextchr.

The other approach BTW would be to add a new UTF8 define that does not
cast its argument, and use that instead.

Or do both. :-)

Anyway, since this bug was present prior to 5.25 I have reverted my
patch entirely. We have until 5.26 is out to work out a better patch.

Thanks, everyone! :)

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 7, 2018

From @khwilliamson

This was fixed by 2c2da8e but the ticket didn't get closed then
--
Karl Williamson

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 7, 2018

@khwilliamson - Status changed from 'open' to 'pending release'

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 23, 2018

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release yesterday of Perl 5.28.0, this and 185 other issues have been
resolved.

Perl 5.28.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.28.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 23, 2018

@khwilliamson - Status changed from 'pending release' to 'resolved'

@p5pRT p5pRT closed this Jun 23, 2018
@p5pRT p5pRT added the Severity Low label Oct 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.