-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile-time warning with UTF8 variable in array index #14601
Comments
From alex@chmrr.netCreated by alex@chmrr.netPerl 5.18.0 and above cause compile-time warnings with: ..but not: I've attached a test.pl file containing the above, in case the U+1D6C3 Passing malformed UTF-8 to "XPosixWord" is deprecated at test.pl line 13. Bisect points to: commit 2812354 Deprecate calling isFOO_utf8() with malformed Perl Info
|
From alex@chmrr.net |
From alex@chmrr.netOn Wed, 18 Mar 2015 17:12:38 -0700 Alex Vandiver (via RT)
Patches for this, as well as 2.5 related problems, attached. |
From alex@chmrr.net0001-perl-124113-Make-check-for-multi-dimensional-arrays-.patchFrom dd02e43995a74db7c64092864bfa894ea3b2a576 Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Sun, 22 Mar 2015 22:39:23 -0400
Subject: [PATCH 1/4] [perl #124113] Make check for multi-dimensional arrays be
UTF8-aware
During parsing, toke.c checks if the user is attempting provide multiple
indexes to an array index:
$a[ $foo, $bar ];
However, while checking for word characters in variable names is aware
of multi-byte characters if "use utf8" is enabled, the loop is only
advanced one byte at a time, not one character at a time. As such,
multibyte variables in array indexes incorrectly yield warnings:
Passing malformed UTF-8 to "XPosixWord" is deprecated
Malformed UTF-8 character (unexpected continuation byte 0x9d, with
no preceding start byte)
Switch the loop to advance character-by-character if UTF-8 semantics are
in use.
---
t/lib/warnings/toke | 10 ++++++++++
toke.c | 2 +-
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index 5d31104..018f188 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -1521,3 +1521,13 @@ Use of literal control characters in variable names is deprecated at (eval 2) li
-a;
;-a;
EXPECT
+########
+# toke.c
+# [perl #124113] Compile-time warning with UTF8 variable in array index
+use warnings;
+use utf8;
+my $𝛃 = 0;
+my @array = (0);
+my $v = $array[ 0 + $𝛃 ];
+ $v = $array[ $𝛃 + 0 ];
+EXPECT
diff --git a/toke.c b/toke.c
index 610db62..50eb89b 100644
--- a/toke.c
+++ b/toke.c
@@ -6049,7 +6049,7 @@ Perl_yylex(pTHX)
char *t = s+1;
while (isSPACE(*t) || isWORDCHAR_lazy_if(t,UTF) || *t == '$')
- t++;
+ t += UTF ? UTF8SKIP(t) : 1;
if (*t++ == ',') {
PL_bufptr = skipspace(PL_bufptr); /* XXX can realloc */
while (t < PL_bufend && *t != ']')
--
2.3.3
|
From alex@chmrr.net0002-Allow-unquoted-UTF-8-HERE-document-terminators.patchFrom 21c279c0ea6a8425e3121de8670e10874c1cf5b8 Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Sun, 22 Mar 2015 22:45:54 -0400
Subject: [PATCH 2/4] Allow unquoted UTF-8 HERE-document terminators
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When not explicitly quoted, tokenization of the HERE-document terminator
dealt improperly with multi-byte characters, advancing one byte at a
time instead of one character at a time. This lead to
incomprehensible-to-the-user errors of the form:
Passing malformed UTF-8 to "XPosixWord" is deprecated
Malformed UTF-8 character (unexpected continuation byte 0xa7, with
no preceding start byte)
Can't find string terminator "EnFra�" anywhere before EOF
If enclosed in single or double quotes, parsing was correctly effected,
as delimcpy advances byte-by-byte, but looks only for the single-byte
ending character.
When doing a \w+ match looking for the end of the word, advance
character-by-character instead of byte-by-byte, ensuring that the size
does not extend past the available size in PL_tokenbuf.
---
t/lib/warnings/toke | 11 +++++++++++
toke.c | 10 +++++++---
2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index 018f188..b1d5347 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -1531,3 +1531,14 @@ my @array = (0);
my $v = $array[ 0 + $𝛃 ];
$v = $array[ $𝛃 + 0 ];
EXPECT
+########
+# toke.c
+# Allow Unicode here doc boundaries
+use warnings;
+use utf8;
+my $v = <<EnFraçais;
+Comme ca!
+EnFraçais
+print $v;
+EXPECT
+Comme ca!
diff --git a/toke.c b/toke.c
index 50eb89b..d81689f 100644
--- a/toke.c
+++ b/toke.c
@@ -9210,10 +9210,14 @@ S_scan_heredoc(pTHX_ char *s)
term = '"';
if (!isWORDCHAR_lazy_if(s,UTF))
deprecate("bare << to mean <<\"\"");
- for (; isWORDCHAR_lazy_if(s,UTF); s++) {
- if (d < e)
- *d++ = *s;
+ peek = s;
+ while (isWORDCHAR_lazy_if(peek,UTF)) {
+ peek += UTF ? UTF8SKIP(peek) : 1;
}
+ len = (peek - s >= e - d) ? (e - d) : (peek - s);
+ Copy(s, d, len, char);
+ s += len;
+ d += len;
}
if (d >= PL_tokenbuf + sizeof PL_tokenbuf - 1)
Perl_croak(aTHX_ "Delimiter for here document is too long");
--
2.3.3
|
From alex@chmrr.net0003-Fix-.without-parentheses-is-ambuguous-warning-for-UT.patchFrom 93439761a29d0ae39c6716814249e2c075562522 Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Sun, 22 Mar 2015 23:08:24 -0400
Subject: [PATCH 3/4] Fix "...without parentheses is ambuguous" warning for
UTF-8 function names
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
While isWORDCHAR_lazy_if is UTF-8 aware, checking advanced byte-by-byte.
This lead to errors of the form:
Passing malformed UTF-8 to "XPosixWord" is deprecated
Malformed UTF-8 character (unexpected continuation byte 0x9d, with
no preceding start byte)
Warning: Use of "�" without parentheses is ambiguous
Use UTF8SKIP to advance character-by-character, not byte-by-byte.
---
t/lib/warnings/toke | 10 ++++++++++
toke.c | 2 +-
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index b1d5347..6cbce2e 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -1542,3 +1542,13 @@ EnFraçais
print $v;
EXPECT
Comme ca!
+########
+# toke.c
+# Fix 'Use of "..." without parentheses is ambiguous' warning for
+# Unicode function names
+use utf8;
+use warnings;
+sub 𝛃(;$) { return 0; }
+my $v = 𝛃 - 5;
+EXPECT
+Warning: Use of "𝛃" without parentheses is ambiguous at - line 7.
diff --git a/toke.c b/toke.c
index d81689f..338e1fd 100644
--- a/toke.c
+++ b/toke.c
@@ -1841,7 +1841,7 @@ S_check_uni(pTHX)
PL_last_uni++;
s = PL_last_uni;
while (isWORDCHAR_lazy_if(s,UTF) || *s == '-')
- s++;
+ s += UTF ? UTF8SKIP(s) : 1;
if ((t = strchr(s, '(')) && t < PL_bufptr)
return;
--
2.3.3
|
From alex@chmrr.net0004-Adjust-callsites-that-use-UTF8SKIP-without-checking-.patchFrom 932f135e5936c3bcc57382dc44dbfcd7225600df Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Sun, 22 Mar 2015 23:44:11 -0400
Subject: [PATCH 4/4] Adjust callsites that use UTF8SKIP without checking UTF
Assuming UTF-8 semantics and advancing character-by-character when 'use
utf8' is not enabled is not as problematic as the inverse. However,
properly UTF8SKIP should only be used when UTF8 semantics are explicitly
asked for.
Change the three occurrences of UTF8SKIP that are not protected by UTF
checks.
---
toke.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/toke.c b/toke.c
index 338e1fd..317dbcc 100644
--- a/toke.c
+++ b/toke.c
@@ -5648,12 +5648,12 @@ Perl_yylex(pTHX)
else
/* skip plain q word */
while (t < PL_bufend && isWORDCHAR_lazy_if(t,UTF))
- t += UTF8SKIP(t);
+ t += UTF ? UTF8SKIP(t) : 1;
}
else if (isWORDCHAR_lazy_if(t,UTF)) {
- t += UTF8SKIP(t);
+ t += UTF ? UTF8SKIP(t) : 1;
while (t < PL_bufend && isWORDCHAR_lazy_if(t,UTF))
- t += UTF8SKIP(t);
+ t += UTF ? UTF8SKIP(t) : 1;
}
while (t < PL_bufend && isSPACE(*t))
t++;
--
2.3.3
|
From @cpansproutOn Sun Mar 22 21:12:43 2015, alex@chmrr.net wrote:
Thank you. I have applied the first three patches: $ git log --oneline -3 I don’t like the fact that the fourth one lacks tests, even though I believe it corrects the behaviour. Also, what it addresses is not a regression of any sort, so it needs to wait until after 5.22 since we are in code freeze. -- Father Chrysostomos |
The RT System itself - Status changed from 'new' to 'open' |
From @iabynOn Fri, Mar 27, 2015 at 01:17:58PM -0700, Father Chrysostomos via RT wrote:
This one is intermittently failing smokes. The test is: # toke.c Run by hand, this (correctly) gives me: Warning: Use of "𝛃" without parentheses is ambiguous at /tmp/p line 7. 'od' shows that the bytes that make up the beta in the src are: f0 9d 9b 83 (i.e. codepoint \x{1d6c3}) and that the bytes output for the beta in the warning message when run f0 9d 9b 83 According to the George Greer's smoke log, http://m-l.org/~perl/smoke/perl/linux/blead_g++/log38f18a308b948c6eaf187519a16d060e1ec7cc20.log.gz The output is: EXPECTED: where the bytes that make up AAAA and BBBB are: c3 b0 c2 9d c2 9b c2 83 c3 83 c2 b0 c3 82 c2 9d c3 82 c2 9b c3 82 c2 83 AAAA is the original bytes double-encoded, while BBBB is triple-encoded. I guess that one extra level of encoding is caused by the smoker code when The smokes seem to only fail for the permutations with LC_ALL=en_US.utf8. -- |
From alex@chmrr.netOn Mon, 30 Mar 2015 12:00:26 +0100 Dave Mitchell <davem@iabyn.com>
Thanks for the note -- I'll take a closer look tonight. - Alex |
From @nwc10On Mon, Mar 30, 2015 at 12:00:26PM +0100, Dave Mitchell wrote:
I can consistently see the failures under t/harness on dromedary with I ran a bisect as: LC_ALL=en_US.UTF-8 PERL_UNICODE="" perl Porting/bisect.pl --start v5.21.10 --target lib/warnings.t and it reports that the errors start at this commit: commit 8ce2ba8 Fix "...without parentheses is ambuguous" warning for UTF-8 function names (and by implication they are not a side effect of a later commit) I'm not in a position to investigate further as to why, let alone provide a Nicholas Clark |
From alex@chmrr.netOn Mon, 30 Mar 2015 19:38:24 +0100 Nicholas Clark <nick@ccl4.org> wrote:
The test failure requires PERL_UNICODE="", and uncovers a warning which |
From alex@chmrr.net0001-toke.c-UTF-8-aware-warning-cleanups.patchFrom 3b98ad2da63a7f9c25d3b5b1063d3787e0b6790a Mon Sep 17 00:00:00 2001
From: Alex Vandiver <alex@chmrr.net>
Date: Tue, 31 Mar 2015 03:46:41 -0400
Subject: [PATCH] toke.c: UTF-8 aware warning cleanups
---
t/lib/warnings/toke | 6 ++++--
toke.c | 13 +++++++------
2 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/t/lib/warnings/toke b/t/lib/warnings/toke
index 6cbce2e..dab8451 100644
--- a/t/lib/warnings/toke
+++ b/t/lib/warnings/toke
@@ -1545,10 +1545,12 @@ Comme ca!
########
# toke.c
# Fix 'Use of "..." without parentheses is ambiguous' warning for
-# Unicode function names
+# Unicode function names. If not under PERL_UNICODE, this will generate
+# a "Wide character" warning
use utf8;
use warnings;
sub 𝛃(;$) { return 0; }
my $v = 𝛃 - 5;
EXPECT
-Warning: Use of "𝛃" without parentheses is ambiguous at - line 7.
+OPTION regex
+(Wide character.*\n)?Warning: Use of "𝛃" without parentheses is ambiguous
diff --git a/toke.c b/toke.c
index a115c28..ed7967c 100644
--- a/toke.c
+++ b/toke.c
@@ -1846,8 +1846,8 @@ S_check_uni(pTHX)
return;
Perl_ck_warner_d(aTHX_ packWARN(WARN_AMBIGUOUS),
- "Warning: Use of \"%.*s\" without parentheses is ambiguous",
- (int)(s - PL_last_uni), PL_last_uni);
+ "Warning: Use of \"%"UTF8f"\" without parentheses is ambiguous",
+ UTF8fARG(UTF, (int)(s - PL_last_uni), PL_last_uni));
}
/*
@@ -2529,9 +2529,10 @@ S_get_and_check_backslash_N_name(pTHX_ const char* s, const char* const e)
/* We deliberately don't try to print the malformed character, which
* might not print very well; it also may be just the first of many
* malformations, so don't print what comes after it */
- yyerror(Perl_form(aTHX_
+ yyerror_pv(Perl_form(aTHX_
"Malformed UTF-8 character immediately after '%.*s'",
- (int) (first_bad_char_loc - (U8 *) backslash_ptr), backslash_ptr));
+ (int) (first_bad_char_loc - (U8 *) backslash_ptr), backslash_ptr),
+ SVf_UTF8);
return NULL;
}
@@ -6055,8 +6056,8 @@ Perl_yylex(pTHX)
while (t < PL_bufend && *t != ']')
t++;
Perl_warner(aTHX_ packWARN(WARN_SYNTAX),
- "Multidimensional syntax %.*s not supported",
- (int)((t - PL_bufptr) + 1), PL_bufptr);
+ "Multidimensional syntax %"UTF8f" not supported",
+ UTF8fARG(UTF,(int)((t - PL_bufptr) + 1), PL_bufptr));
}
}
}
--
2.3.4
|
From @iabynOn Tue, Mar 31, 2015 at 04:15:48AM -0400, Alex Vandiver wrote:
Ah, I always forget the PERL_UNICODE="" bit. Thanks, applied as v5.21.10-49-gb59c097. -- |
From @maukeOn Tue Mar 31 01:51:07 2015, davem wrote:
Can this ticket be closed? It's listed in perl5220delta. |
From @iabynOn Fri, Feb 26, 2016 at 11:03:17AM -0800, l.mai@web.de via RT wrote:
There was one un-applied patch still in the ticket: 0004-Adjust-callsites-that-use-UTF8SKIP-without-checking-.patch Which I've just applied, as v5.23.8-35-g9538abe, so the ticket can be Originally FC was reluctant to apply it since there weren't any tests In more detail, it has a three changes like: while (t < PL_bufend && isWORDCHAR_lazy_if(t,UTF)) but if UTF is false, then isWORDCHAR_lazy_if() will be false for any -- |
@iabyn - Status changed from 'open' to 'resolved' |
Migrated from rt.perl.org#124113 (status was 'resolved')
Searchable as RT124113$
The text was updated successfully, but these errors were encountered: