Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile-time warning with UTF8 variable in array index #14601

Closed
p5pRT opened this issue Mar 19, 2015 · 18 comments
Closed

Compile-time warning with UTF8 variable in array index #14601

p5pRT opened this issue Mar 19, 2015 · 18 comments

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Mar 19, 2015

Migrated from rt.perl.org#124113 (status was 'resolved')

Searchable as RT124113$

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 19, 2015

From alex@chmrr.net

Created by alex@chmrr.net

Perl 5.18.0 and above cause compile-time warnings with​:

  $array[ $𝛃 + 0 ];

..but not​:

  $array[ 0 + $𝛃 ];

I've attached a test.pl file containing the above, in case the U+1D6C3
gets crrupted in transit. The warnings are​:

  Passing malformed UTF-8 to "XPosixWord" is deprecated at test.pl line 13.
  Malformed UTF-8 character (unexpected continuation byte 0x9d, with no preceding start byte) at test.pl line 13.

Bisect points to​:

commit 2812354
Author​: Karl Williamson <public@​khwilliamson.com>
Date​: Sun Dec 23 10​:03​:16 2012 -0700

  Deprecate calling isFOO_utf8() with malformed
 
  handy.h has character classification macros to determine if a UTF-8
  encoded character is of a given type FOO, such as isALPHA_utf8(), etc.
  Code that calls these should have first made sure that the parameter is
  legal UTF-8. Prior to this patch, false was silently returned for all
  illegal UTF-8. Now, in most instances, a deprecation warning is raised.
  This is to catch bugs, and prepare for eventual elimination of this
  check, which fails to catch read-off-end-of-buffer malformations anyway.
  (One idea would be to leave the check in for DEBUGGING builds.)
 
  The cases where no deprecation warning is raised as a result of this
  commit is for the classes where the character does not have to be
  converted to a code point for its inclusion to be determined. For
  example, if malformed UTF-8 is checked to see if it is ASCII, we only
  need to check that it is one of the 128 ASCII characters. If it isn't,
  we don't bother to see if it is malformed or not. There are other
  cases, as well, such as with isSPACE(), where we check if the UTF-8 is
  one of a very finite set, without checking for malformedness.
 
  This commit causes a number of apparent bugs to be shown by the Perl
  test suite. These do not cause actual failures.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.20.2:

Configured by chmrr at Wed Mar 18 20:04:27 EDT 2015.

Summary of my perl5 (revision 5 version 20 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=3.13.0-44-generic, archname=x86_64-linux
    uname='linux mycon.chmrr.net 3.13.0-44-generic #73-ubuntu smp tue dec 16 00:22:43 utc 2014 x86_64 x86_64 x86_64 gnulinux '
    config_args='-de -Dprefix=/opt/perlbrew/perls/perl-5.20.2 -Aeval:scriptdir=/opt/perlbrew/perls/perl-5.20.2/bin'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=undef, usemultiplicity=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
    ccversion='', gccversion='4.8.2', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -fstack-protector -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.8/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib
    libs=-lnsl -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.19'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'



@INC for perl 5.20.2:
    /opt/perlbrew/perls/perl-5.20.2/lib/site_perl/5.20.2/x86_64-linux
    /opt/perlbrew/perls/perl-5.20.2/lib/site_perl/5.20.2
    /opt/perlbrew/perls/perl-5.20.2/lib/5.20.2/x86_64-linux
    /opt/perlbrew/perls/perl-5.20.2/lib/5.20.2
    .


Environment for perl 5.20.2:
    HOME=/home/chmrr
    LANG=en_US.UTF-8
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/opt/perlbrew/bin:/opt/perlbrew/perls/perl-5.20.2/bin:/home/chmrr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
    PERLBREW_BASHRC_VERSION=0.67
    PERLBREW_HOME=/home/chmrr/.perlbrew
    PERLBREW_MANPATH=/opt/perlbrew/perls/perl-5.20.2/man
    PERLBREW_PATH=/opt/perlbrew/bin:/opt/perlbrew/perls/perl-5.20.2/bin
    PERLBREW_PERL=perl-5.20.2
    PERLBREW_ROOT=/opt/perlbrew
    PERLBREW_VERSION=0.66
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 19, 2015

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 23, 2015

From alex@chmrr.net

On Wed, 18 Mar 2015 17​:12​:38 -0700 Alex Vandiver (via RT)

Perl 5.18.0 and above cause compile-time warnings with​:

$array\[ $𝛃 \+ 0 \];

..but not​:

$array\[ 0 \+ $𝛃 \];

Patches for this, as well as 2.5 related problems, attached.
- Alex

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 23, 2015

From alex@chmrr.net

0001-perl-124113-Make-check-for-multi-dimensional-arrays-.patch
From dd02e43995a74db7c64092864bfa894ea3b2a576 Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Sun, 22 Mar 2015 22:39:23 -0400
Subject: [PATCH 1/4] [perl #124113] Make check for multi-dimensional arrays be
 UTF8-aware

During parsing, toke.c checks if the user is attempting provide multiple
indexes to an array index:

    $a[ $foo, $bar ];

However, while checking for word characters in variable names is aware
of multi-byte characters if "use utf8" is enabled, the loop is only
advanced one byte at a time, not one character at a time.  As such,
multibyte variables in array indexes incorrectly yield warnings:

    Passing malformed UTF-8 to "XPosixWord" is deprecated
    Malformed UTF-8 character (unexpected continuation byte 0x9d, with
      no preceding start byte)

Switch the loop to advance character-by-character if UTF-8 semantics are
in use.
---
 t/lib/warnings/toke | 10 ++++++++++
 toke.c              |  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index 5d31104..018f188 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -1521,3 +1521,13 @@ Use of literal control characters in variable names is deprecated at (eval 2) li
 -a;
 ;-a;
 EXPECT
+########
+# toke.c
+# [perl #124113] Compile-time warning with UTF8 variable in array index
+use warnings;
+use utf8;
+my $𝛃 = 0;
+my @array = (0);
+my $v = $array[ 0 + $𝛃 ];
+   $v = $array[ $𝛃 + 0 ];
+EXPECT
diff --git a/toke.c b/toke.c
index 610db62..50eb89b 100644
--- a/toke.c
+++ b/toke.c
@@ -6049,7 +6049,7 @@ Perl_yylex(pTHX)
 			char *t = s+1;
 
 			while (isSPACE(*t) || isWORDCHAR_lazy_if(t,UTF) || *t == '$')
-			    t++;
+			    t += UTF ? UTF8SKIP(t) : 1;
 			if (*t++ == ',') {
 			    PL_bufptr = skipspace(PL_bufptr); /* XXX can realloc */
 			    while (t < PL_bufend && *t != ']')
-- 
2.3.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 23, 2015

From alex@chmrr.net

0002-Allow-unquoted-UTF-8-HERE-document-terminators.patch
From 21c279c0ea6a8425e3121de8670e10874c1cf5b8 Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Sun, 22 Mar 2015 22:45:54 -0400
Subject: [PATCH 2/4] Allow unquoted UTF-8 HERE-document terminators
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When not explicitly quoted, tokenization of the HERE-document terminator
dealt improperly with multi-byte characters, advancing one byte at a
time instead of one character at a time.  This lead to
incomprehensible-to-the-user errors of the form:

    Passing malformed UTF-8 to "XPosixWord" is deprecated
    Malformed UTF-8 character (unexpected continuation byte 0xa7, with
      no preceding start byte)
    Can't find string terminator "EnFra�" anywhere before EOF

If enclosed in single or double quotes, parsing was correctly effected,
as delimcpy advances byte-by-byte, but looks only for the single-byte
ending character.

When doing a \w+ match looking for the end of the word, advance
character-by-character instead of byte-by-byte, ensuring that the size
does not extend past the available size in PL_tokenbuf.
---
 t/lib/warnings/toke | 11 +++++++++++
 toke.c              | 10 +++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index 018f188..b1d5347 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -1531,3 +1531,14 @@ my @array = (0);
 my $v = $array[ 0 + $𝛃 ];
    $v = $array[ $𝛃 + 0 ];
 EXPECT
+########
+# toke.c
+# Allow Unicode here doc boundaries
+use warnings;
+use utf8;
+my $v = <<EnFraçais;
+Comme ca!
+EnFraçais
+print $v;
+EXPECT
+Comme ca!
diff --git a/toke.c b/toke.c
index 50eb89b..d81689f 100644
--- a/toke.c
+++ b/toke.c
@@ -9210,10 +9210,14 @@ S_scan_heredoc(pTHX_ char *s)
 	    term = '"';
 	if (!isWORDCHAR_lazy_if(s,UTF))
 	    deprecate("bare << to mean <<\"\"");
-	for (; isWORDCHAR_lazy_if(s,UTF); s++) {
-	    if (d < e)
-		*d++ = *s;
+	peek = s;
+	while (isWORDCHAR_lazy_if(peek,UTF)) {
+	    peek += UTF ? UTF8SKIP(peek) : 1;
 	}
+	len = (peek - s >= e - d) ? (e - d) : (peek - s);
+	Copy(s, d, len, char);
+	s += len;
+	d += len;
     }
     if (d >= PL_tokenbuf + sizeof PL_tokenbuf - 1)
 	Perl_croak(aTHX_ "Delimiter for here document is too long");
-- 
2.3.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 23, 2015

From alex@chmrr.net

0003-Fix-.without-parentheses-is-ambuguous-warning-for-UT.patch
From 93439761a29d0ae39c6716814249e2c075562522 Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Sun, 22 Mar 2015 23:08:24 -0400
Subject: [PATCH 3/4] Fix "...without parentheses is ambuguous" warning for
 UTF-8 function names
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

While isWORDCHAR_lazy_if is UTF-8 aware, checking advanced byte-by-byte.
This lead to errors of the form:

   Passing malformed UTF-8 to "XPosixWord" is deprecated
   Malformed UTF-8 character (unexpected continuation byte 0x9d, with
     no preceding start byte)
   Warning: Use of "�" without parentheses is ambiguous

Use UTF8SKIP to advance character-by-character, not byte-by-byte.
---
 t/lib/warnings/toke | 10 ++++++++++
 toke.c              |  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index b1d5347..6cbce2e 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -1542,3 +1542,13 @@ EnFraçais
 print $v;
 EXPECT
 Comme ca!
+########
+# toke.c
+# Fix 'Use of "..." without parentheses is ambiguous' warning for
+# Unicode function names
+use utf8;
+use warnings;
+sub 𝛃(;$) { return 0; }
+my $v = 𝛃 - 5;
+EXPECT
+Warning: Use of "𝛃" without parentheses is ambiguous at - line 7.
diff --git a/toke.c b/toke.c
index d81689f..338e1fd 100644
--- a/toke.c
+++ b/toke.c
@@ -1841,7 +1841,7 @@ S_check_uni(pTHX)
 	PL_last_uni++;
     s = PL_last_uni;
     while (isWORDCHAR_lazy_if(s,UTF) || *s == '-')
-	s++;
+	s += UTF ? UTF8SKIP(s) : 1;
     if ((t = strchr(s, '(')) && t < PL_bufptr)
 	return;
 
-- 
2.3.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 23, 2015

From alex@chmrr.net

0004-Adjust-callsites-that-use-UTF8SKIP-without-checking-.patch
From 932f135e5936c3bcc57382dc44dbfcd7225600df Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Sun, 22 Mar 2015 23:44:11 -0400
Subject: [PATCH 4/4] Adjust callsites that use UTF8SKIP without checking UTF

Assuming UTF-8 semantics and advancing character-by-character when 'use
utf8' is not enabled is not as problematic as the inverse.  However,
properly UTF8SKIP should only be used when UTF8 semantics are explicitly
asked for.

Change the three occurrences of UTF8SKIP that are not protected by UTF
checks.
---
 toke.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/toke.c b/toke.c
index 338e1fd..317dbcc 100644
--- a/toke.c
+++ b/toke.c
@@ -5648,12 +5648,12 @@ Perl_yylex(pTHX)
 		    else
 			/* skip plain q word */
 			while (t < PL_bufend && isWORDCHAR_lazy_if(t,UTF))
-			     t += UTF8SKIP(t);
+			    t += UTF ? UTF8SKIP(t) : 1;
 		}
 		else if (isWORDCHAR_lazy_if(t,UTF)) {
-		    t += UTF8SKIP(t);
+		    t += UTF ? UTF8SKIP(t) : 1;
 		    while (t < PL_bufend && isWORDCHAR_lazy_if(t,UTF))
-			 t += UTF8SKIP(t);
+			t += UTF ? UTF8SKIP(t) : 1;
 		}
 		while (t < PL_bufend && isSPACE(*t))
 		    t++;
-- 
2.3.3

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 27, 2015

From @cpansprout

On Sun Mar 22 21​:12​:43 2015, alex@​chmrr.net wrote​:

On Wed, 18 Mar 2015 17​:12​:38 -0700 Alex Vandiver (via RT)

Perl 5.18.0 and above cause compile-time warnings with​:

$array\[ $𝛃 \+ 0 \];

..but not​:

$array\[ 0 \+ $𝛃 \];

Patches for this, as well as 2.5 related problems, attached.
- Alex

Thank you. I have applied the first three patches​:

$ git log --oneline -3
8ce2ba8 Fix "...without parentheses is ambuguous" warning for UTF-8 function nam
6e59c86 Allow unquoted UTF-8 HERE-document terminators
b3089e9 [perl #124113] Make check for multi-dimensional arrays be UTF8-aware

I don’t like the fact that the fourth one lacks tests, even though I believe it corrects the behaviour. Also, what it addresses is not a regression of any sort, so it needs to wait until after 5.22 since we are in code freeze.

--

Father Chrysostomos

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 27, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 30, 2015

From @iabyn

On Fri, Mar 27, 2015 at 01​:17​:58PM -0700, Father Chrysostomos via RT wrote​:

Thank you. I have applied the first three patches​:

$ git log --oneline -3
8ce2ba8 Fix "...without parentheses is ambuguous" warning for UTF-8 function nam

This one is intermittently failing smokes.

The test is​:

  # toke.c
  # Fix 'Use of "..." without parentheses is ambiguous' warning for
  # Unicode function names
  use utf8;
  use warnings;
  sub 𝛃(;$) { return 0; }
  my $v = 𝛃 - 5;

Run by hand, this (correctly) gives me​:

  Warning​: Use of "𝛃" without parentheses is ambiguous at /tmp/p line 7.

'od' shows that the bytes that make up the beta in the src are​:

  f0 9d 9b 83 (i.e. codepoint \x{1d6c3})

and that the bytes output for the beta in the warning message when run
by hand are​:

  f0 9d 9b 83

According to the George Greer's smoke log,

  http​://m-l.org/~perl/smoke/perl/linux/blead_g++/log38f18a308b948c6eaf187519a16d060e1ec7cc20.log.gz

The output is​:

  EXPECTED​:
  Warning​: Use of "AAAA" without parentheses is ambiguous at - line 7.
  GOT​:
  Warning​: Use of "BBBB" without parentheses is ambiguous at - line 7.

where the bytes that make up AAAA and BBBB are​:

  c3 b0 c2 9d c2 9b c2 83

  c3 83 c2 b0 c3 82 c2 9d c3 82 c2 9b c3 82 c2 83

AAAA is the original bytes double-encoded, while BBBB is triple-encoded.

I guess that one extra level of encoding is caused by the smoker code when
generating smoke logs, but I don't see why the 'got' message should have
an extra level of encoding on top of that, and why it's intermittent
(sometimes a mismatch between TEST and harness, and for some
configurations not at all), and why it doesn't fail for me.

The smokes seem to only fail for the permutations with LC_ALL=en_US.utf8.

--
Nothing ventured, nothing lost.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 30, 2015

From alex@chmrr.net

On Mon, 30 Mar 2015 12​:00​:26 +0100 Dave Mitchell <davem@​iabyn.com>
wrote​:

On Fri, Mar 27, 2015 at 01​:17​:58PM -0700, Father Chrysostomos via RT wrote​:

Thank you. I have applied the first three patches​:

$ git log --oneline -3
8ce2ba8 Fix "...without parentheses is ambuguous" warning for UTF-8 function nam

This one is intermittently failing smokes.
[snip]

Thanks for the note -- I'll take a closer look tonight.

- Alex

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 30, 2015

From @nwc10

On Mon, Mar 30, 2015 at 12​:00​:26PM +0100, Dave Mitchell wrote​:

I guess that one extra level of encoding is caused by the smoker code when
generating smoke logs, but I don't see why the 'got' message should have
an extra level of encoding on top of that, and why it's intermittent
(sometimes a mismatch between TEST and harness, and for some
configurations not at all), and why it doesn't fail for me.

The smokes seem to only fail for the permutations with LC_ALL=en_US.utf8.

I can consistently see the failures under t/harness on dromedary with
LC_ALL=en_US.utf8

I ran a bisect as​:

LC_ALL=en_US.UTF-8 PERL_UNICODE="" perl Porting/bisect.pl --start v5.21.10 --target lib/warnings.t

and it reports that the errors start at this commit​:

commit 8ce2ba8
Author​: Alex Vandiver <alex@​chmrr.net>
Date​: Sun Mar 22 23​:08​:24 2015 -0400

  Fix "...without parentheses is ambuguous" warning for UTF-8 function names
 
  While isWORDCHAR_lazy_if is UTF-8 aware, checking advanced byte-by-byte.
  This lead to errors of the form​:
 
  Passing malformed UTF-8 to "XPosixWord" is deprecated
  Malformed UTF-8 character (unexpected continuation byte 0x9d, with
  no preceding start byte)
  Warning​: Use of "�" without parentheses is ambiguous
 
  Use UTF8SKIP to advance character-by-character, not byte-by-byte.

(and by implication they are not a side effect of a later commit)

I'm not in a position to investigate further as to why, let alone provide a
fix.

Nicholas Clark

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 31, 2015

From alex@chmrr.net

On Mon, 30 Mar 2015 19​:38​:24 +0100 Nicholas Clark <nick@​ccl4.org> wrote​:

I can consistently see the failures under t/harness on dromedary with
LC_ALL=en_US.utf8

I ran a bisect as​:

LC_ALL=en_US.UTF-8 PERL_UNICODE="" perl Porting/bisect.pl --start v5.21.10 --target lib/warnings.t

The test failure requires PERL_UNICODE="", and uncovers a warning which
was missing a UTF8fARG(). A little more poking around uncovered a
couple more as well; patch attached. The fact that the wide character
is reported "in print" and not "in warn" is likely its own bug.
- Alex

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 31, 2015

From alex@chmrr.net

0001-toke.c-UTF-8-aware-warning-cleanups.patch
From 3b98ad2da63a7f9c25d3b5b1063d3787e0b6790a Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Tue, 31 Mar 2015 03:46:41 -0400
Subject: [PATCH] toke.c: UTF-8 aware warning cleanups

---
 t/lib/warnings/toke |  6 ++++--
 toke.c              | 13 +++++++------
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index 6cbce2e..dab8451 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -1545,10 +1545,12 @@ Comme ca!
 ########
 # toke.c
 # Fix 'Use of "..." without parentheses is ambiguous' warning for
-# Unicode function names
+# Unicode function names.  If not under PERL_UNICODE, this will generate
+# a "Wide character" warning
 use utf8;
 use warnings;
 sub 𝛃(;$) { return 0; }
 my $v = 𝛃 - 5;
 EXPECT
-Warning: Use of "𝛃" without parentheses is ambiguous at - line 7.
+OPTION regex
+(Wide character.*\n)?Warning: Use of "𝛃" without parentheses is ambiguous
diff --git a/toke.c b/toke.c
index a115c28..ed7967c 100644
--- a/toke.c
+++ b/toke.c
@@ -1846,8 +1846,8 @@ S_check_uni(pTHX)
 	return;
 
     Perl_ck_warner_d(aTHX_ packWARN(WARN_AMBIGUOUS),
-		     "Warning: Use of \"%.*s\" without parentheses is ambiguous",
-		     (int)(s - PL_last_uni), PL_last_uni);
+		     "Warning: Use of \"%"UTF8f"\" without parentheses is ambiguous",
+		     UTF8fARG(UTF, (int)(s - PL_last_uni), PL_last_uni));
 }
 
 /*
@@ -2529,9 +2529,10 @@ S_get_and_check_backslash_N_name(pTHX_ const char* s, const char* const e)
         /* We deliberately don't try to print the malformed character, which
          * might not print very well; it also may be just the first of many
          * malformations, so don't print what comes after it */
-        yyerror(Perl_form(aTHX_
+        yyerror_pv(Perl_form(aTHX_
             "Malformed UTF-8 character immediately after '%.*s'",
-            (int) (first_bad_char_loc - (U8 *) backslash_ptr), backslash_ptr));
+            (int) (first_bad_char_loc - (U8 *) backslash_ptr), backslash_ptr),
+                   SVf_UTF8);
 	return NULL;
     }
 
@@ -6055,8 +6056,8 @@ Perl_yylex(pTHX)
 			    while (t < PL_bufend && *t != ']')
 				t++;
 			    Perl_warner(aTHX_ packWARN(WARN_SYNTAX),
-					"Multidimensional syntax %.*s not supported",
-				    (int)((t - PL_bufptr) + 1), PL_bufptr);
+					"Multidimensional syntax %"UTF8f" not supported",
+                                        UTF8fARG(UTF,(int)((t - PL_bufptr) + 1), PL_bufptr));
 			}
 		    }
 		}
-- 
2.3.4

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 31, 2015

From @iabyn

On Tue, Mar 31, 2015 at 04​:15​:48AM -0400, Alex Vandiver wrote​:

On Mon, 30 Mar 2015 19​:38​:24 +0100 Nicholas Clark <nick@​ccl4.org> wrote​:

I can consistently see the failures under t/harness on dromedary with
LC_ALL=en_US.utf8

I ran a bisect as​:

LC_ALL=en_US.UTF-8 PERL_UNICODE="" perl Porting/bisect.pl --start v5.21.10 --target lib/warnings.t

The test failure requires PERL_UNICODE="", and uncovers a warning which
was missing a UTF8fARG(). A little more poking around uncovered a
couple more as well; patch attached. The fact that the wide character
is reported "in print" and not "in warn" is likely its own bug.

Ah, I always forget the PERL_UNICODE="" bit.

Thanks, applied as v5.21.10-49-gb59c097.

--
But Pity stayed his hand. "It's a pity I've run out of bullets",
he thought. -- "Bored of the Rings"

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 26, 2016

From @mauke

On Tue Mar 31 01​:51​:07 2015, davem wrote​:

On Tue, Mar 31, 2015 at 04​:15​:48AM -0400, Alex Vandiver wrote​:

On Mon, 30 Mar 2015 19​:38​:24 +0100 Nicholas Clark <nick@​ccl4.org>
wrote​:

I can consistently see the failures under t/harness on dromedary
with
LC_ALL=en_US.utf8

I ran a bisect as​:

LC_ALL=en_US.UTF-8 PERL_UNICODE="" perl Porting/bisect.pl --start
v5.21.10 --target lib/warnings.t

The test failure requires PERL_UNICODE="", and uncovers a warning
which
was missing a UTF8fARG(). A little more poking around uncovered a
couple more as well; patch attached. The fact that the wide
character
is reported "in print" and not "in warn" is likely its own bug.

Ah, I always forget the PERL_UNICODE="" bit.

Thanks, applied as v5.21.10-49-gb59c097.

Can this ticket be closed? It's listed in perl5220delta.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 29, 2016

From @iabyn

On Fri, Feb 26, 2016 at 11​:03​:17AM -0800, l.mai@​web.de via RT wrote​:

On Tue Mar 31 01​:51​:07 2015, davem wrote​:

On Tue, Mar 31, 2015 at 04​:15​:48AM -0400, Alex Vandiver wrote​:

On Mon, 30 Mar 2015 19​:38​:24 +0100 Nicholas Clark <nick@​ccl4.org>
wrote​:

I can consistently see the failures under t/harness on dromedary
with
LC_ALL=en_US.utf8

I ran a bisect as​:

LC_ALL=en_US.UTF-8 PERL_UNICODE="" perl Porting/bisect.pl --start
v5.21.10 --target lib/warnings.t

The test failure requires PERL_UNICODE="", and uncovers a warning
which
was missing a UTF8fARG(). A little more poking around uncovered a
couple more as well; patch attached. The fact that the wide
character
is reported "in print" and not "in warn" is likely its own bug.

Ah, I always forget the PERL_UNICODE="" bit.

Thanks, applied as v5.21.10-49-gb59c097.

Can this ticket be closed? It's listed in perl5220delta.

There was one un-applied patch still in the ticket​:

  0004-Adjust-callsites-that-use-UTF8SKIP-without-checking-.patch

Which I've just applied, as v5.23.8-35-g9538abe, so the ticket can be
closed.

Originally FC was reluctant to apply it since there weren't any tests
and it was near 5.22.0 release. However, looking at it more closely,
I don't think it fixes a bug or changes behaviour.

In more detail, it has a three changes like​:

  while (t < PL_bufend && isWORDCHAR_lazy_if(t,UTF))
  - t += UTF8SKIP(t);
  + t += UTF ? UTF8SKIP(t) : 1;

but if UTF is false, then isWORDCHAR_lazy_if() will be false for any
byte >= 0x80, so UTF8SKIP wouldn't be called anyway. For bytes < 0x80,
UTF8SKIP returns 1. So there's no change in behaviour. However, in terms
of consistency with the rest of toke.c and for avoiding future bugs, its
work applying the change anyway

--
Red sky at night - gerroff my land!
Red sky at morning - gerroff my land!
  -- old farmers' sayings #14

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 29, 2016

@iabyn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.