Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl mess with UTF8 flag of GV #15971

Closed
p5pRT opened this issue May 6, 2017 · 10 comments
Closed

Perl mess with UTF8 flag of GV #15971

p5pRT opened this issue May 6, 2017 · 10 comments
Labels

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented May 6, 2017

Migrated from rt.perl.org#131263 (status was 'resolved')

Searchable as RT131263$

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented May 6, 2017

From @pali

It looks like that perl bleed mess with UTF8 flag of some GV.

When into some *glob is assigned scalar with UTF8 flag and later to same
*glob is assigned scalar without UTF8 flag, then UTF8 flag stay set in
*glob.

Look at following example​:

$str = "\N{U+0080}";
*sym = $str;
(*sym eq "*main​::\N{U+0080}") ? print "ok\n" : print "fail​: " .
  (join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";

$str = "\xC3\x80";
*sym = $str;
(*sym eq "*main​::\xC3\x80") ? print "ok\n" : print "fail​: " .
  (join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";

Its output is​:

ok
fail​: 2a 6d 61 69 6e 3a 3a c0

"*main​::\xC0"

U+00C0 is encoded in UTF-8 as 0xC3, 0x80 which means that *sym after
second assignment has UTF8 flag, even data (from $str) were assigned
without UTF8 flag. And so perl mess with UTF8 flag.

Calling Dump(*sym) or utf8​::is_utf8(*sym) prove this fact that UTF8 flag
is really set.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented May 7, 2017

From @cpansprout

On Sat, 06 May 2017 16​:45​:53 -0700, pali@​cpan.org wrote​:

It looks like that perl bleed mess with UTF8 flag of some GV.

When into some *glob is assigned scalar with UTF8 flag and later to same
*glob is assigned scalar without UTF8 flag, then UTF8 flag stay set in
*glob.

Look at following example​:

$str = "\N{U+0080}";
*sym = $str;
(*sym eq "*main​::\N{U+0080}") ? print "ok\n" : print "fail​: " .
(join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";

$str = "\xC3\x80";
*sym = $str;
(*sym eq "*main​::\xC3\x80") ? print "ok\n" : print "fail​: " .
(join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";

Its output is​:

ok
fail​: 2a 6d 61 69 6e 3a 3a c0

"*main​::\xC0"

U+00C0 is encoded in UTF-8 as 0xC3, 0x80 which means that *sym after
second assignment has UTF8 flag, even data (from $str) were assigned
without UTF8 flag. And so perl mess with UTF8 flag.

Calling Dump(*sym) or utf8​::is_utf8(*sym) prove this fact that UTF8 flag
is really set.

Using this shorter example​:

#!perl
*sym = "\N{U+0080}";
*sym eq "*main​::\N{U+0080}";

*sym = "\xC3\x80";
(*sym eq "*main​::\xC3\x80") ? print "ok\n" : print "fail​: " .
  (join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";
__END__

If you comment out either one of the first two lines, it prints ‘ok’, so the ‘eq’ comparison has something to do with it.

--

Father Chrysostomos

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented May 7, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 5, 2017

From @tonycoz

On Sun, 07 May 2017 06​:45​:31 -0700, sprout wrote​:

On Sat, 06 May 2017 16​:45​:53 -0700, pali@​cpan.org wrote​:

It looks like that perl bleed mess with UTF8 flag of some GV.

When into some *glob is assigned scalar with UTF8 flag and later to
same
*glob is assigned scalar without UTF8 flag, then UTF8 flag stay set
in
*glob.

Look at following example​:

$str = "\N{U+0080}";
*sym = $str;
(*sym eq "*main​::\N{U+0080}") ? print "ok\n" : print "fail​: " .
(join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";

$str = "\xC3\x80";
*sym = $str;
(*sym eq "*main​::\xC3\x80") ? print "ok\n" : print "fail​: " .
(join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";

Its output is​:

ok
fail​: 2a 6d 61 69 6e 3a 3a c0

"*main​::\xC0"

U+00C0 is encoded in UTF-8 as 0xC3, 0x80 which means that *sym after
second assignment has UTF8 flag, even data (from $str) were assigned
without UTF8 flag. And so perl mess with UTF8 flag.

Calling Dump(*sym) or utf8​::is_utf8(*sym) prove this fact that UTF8
flag
is really set.

Using this shorter example​:

#!perl
*sym = "\N{U+0080}";
*sym eq "*main​::\N{U+0080}";

*sym = "\xC3\x80";
(*sym eq "*main​::\xC3\x80") ? print "ok\n" : print "fail​: " .
(join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";
__END__

If you comment out either one of the first two lines, it prints ‘ok’,
so the ‘eq’ comparison has something to do with it.

sv_2pv_flags() would set UTF8 on, but never turned it off.

Fixed by the attached.

Tony

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 5, 2017

From @tonycoz

0001-perl-131263-clear-the-UTF8-flag-on-a-glob-if-it-isn-.patch
From fc661be6f8ba0ef072eb3874d6d7cfbe11329781 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Mon, 5 Jun 2017 14:38:14 +1000
Subject: [PATCH] [perl #131263] clear the UTF8 flag on a glob if it isn't UTF8

Previously sv_2pv_flags() would set the UTF8 flag on a glob if it
had a UTF8 name, but wouldn't clear it if it didn't.

This meant a name change (eg. if assigned another glob) from a UTF8
name to a non-UTF8 name would leave the flag set.
---
 sv.c      |  2 ++
 t/op/gv.t | 10 +++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/sv.c b/sv.c
index db51175..3296519 100644
--- a/sv.c
+++ b/sv.c
@@ -3181,6 +3181,8 @@ Perl_sv_2pv_flags(pTHX_ SV *const sv, STRLEN *const lp, const I32 flags)
 	assert(SvPOK(buffer));
 	if (SvUTF8(buffer))
 	    SvUTF8_on(sv);
+        else
+            SvUTF8_off(sv);
 	if (lp)
 	    *lp = SvCUR(buffer);
 	return SvPVX(buffer);
diff --git a/t/op/gv.t b/t/op/gv.t
index be6b0dd..4c2c43f 100644
--- a/t/op/gv.t
+++ b/t/op/gv.t
@@ -12,7 +12,7 @@ BEGIN {
 
 use warnings;
 
-plan(tests => 280);
+plan(tests => 282);
 
 # type coercion on assignment
 $foo = 'foo';
@@ -1170,6 +1170,14 @@ SKIP: {
     is ($? & 127, 0,"[perl #128597] No crash when gp_free calls ckWARN_d");
 }
 
+{
+    # [perl #131263]
+    *sym = "\N{U+0080}";
+    ok(*sym eq "*main::\N{U+0080}", "utf8 flag properly set");
+    *sym = "\xC3\x80";
+    ok(*sym eq "*main::\xC3\x80", "utf8 flag properly cleared");
+}
+
 # test gv_try_downgrade()
 # If a GV can be stored in a stash in a compact, non-GV form, then
 # whenever ops are freed which reference the GV, an attempt is made to
-- 
2.1.4

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 5, 2017

From @jkeenan

On Mon, 05 Jun 2017 04​:39​:12 GMT, tonyc wrote​:

On Sun, 07 May 2017 06​:45​:31 -0700, sprout wrote​:

On Sat, 06 May 2017 16​:45​:53 -0700, pali@​cpan.org wrote​:

It looks like that perl bleed mess with UTF8 flag of some GV.

When into some *glob is assigned scalar with UTF8 flag and later to
same
*glob is assigned scalar without UTF8 flag, then UTF8 flag stay set
in
*glob.

Look at following example​:

$str = "\N{U+0080}";
*sym = $str;
(*sym eq "*main​::\N{U+0080}") ? print "ok\n" : print "fail​: " .
(join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";

$str = "\xC3\x80";
*sym = $str;
(*sym eq "*main​::\xC3\x80") ? print "ok\n" : print "fail​: " .
(join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";

Its output is​:

ok
fail​: 2a 6d 61 69 6e 3a 3a c0

"*main​::\xC0"

U+00C0 is encoded in UTF-8 as 0xC3, 0x80 which means that *sym after
second assignment has UTF8 flag, even data (from $str) were assigned
without UTF8 flag. And so perl mess with UTF8 flag.

Calling Dump(*sym) or utf8​::is_utf8(*sym) prove this fact that UTF8
flag
is really set.

Using this shorter example​:

#!perl
*sym = "\N{U+0080}";
*sym eq "*main​::\N{U+0080}";

*sym = "\xC3\x80";
(*sym eq "*main​::\xC3\x80") ? print "ok\n" : print "fail​: " .
(join " ", map { sprintf "%x", ord($_) } split //, *sym) . "\n";
__END__

If you comment out either one of the first two lines, it prints ‘ok’,
so the ‘eq’ comparison has something to do with it.

sv_2pv_flags() would set UTF8 on, but never turned it off.

Fixed by the attached.

Tony

Tested (in TDD manner) on Linux and two FreeBSD VMs. DWIMs. +1

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 7, 2017

From @tonycoz

On Mon, 05 Jun 2017 05​:58​:28 -0700, jkeenan wrote​:

Tested (in TDD manner) on Linux and two FreeBSD VMs. DWIMs. +1

Thanks for testing it, applied as 1097da1.

Tony

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 7, 2017

@tonycoz - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 23, 2018

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release yesterday of Perl 5.28.0, this and 185 other issues have been
resolved.

Perl 5.28.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.28.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT p5pRT closed this Jun 23, 2018
@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 23, 2018

@khwilliamson - Status changed from 'pending release' to 'resolved'

@p5pRT p5pRT added the Severity Low label Oct 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.