Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dump.c cannot dump Unicode stash names #11762

Closed
p5pRT opened this issue Nov 21, 2011 · 12 comments
Closed

dump.c cannot dump Unicode stash names #11762

p5pRT opened this issue Nov 21, 2011 · 12 comments

Comments

@p5pRT
Copy link

@p5pRT p5pRT commented Nov 21, 2011

Migrated from rt.perl.org#104116 (status was 'resolved')

Searchable as RT104116$

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Nov 21, 2011

From @cpansprout

$ ./perl -Ilib -Mutf8 -MDevel​::Peek -e '*fòò​:: = *bǎr​::; Dump \%fòò​::'
SV = IV(0x8666cc) at 0x8666d0
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x867e40
  SV = PVHV(0x808330) at 0x867e40
  REFCNT = 2
  FLAGS = (OOK,SHAREKEYS)
  ARRAY = 0x28c610
  KEYS = 0
  FILL = 0
  MAX = 7
  RITER = -1
  EITER = 0x0
  NAME = "bǎr"
  NAMECOUNT = 2
  ENAME = "bǎr", "f??"

Those question marks represent Latin-1 bytes that my UTF-8 terminal could not render. bǎr is output in UTF-8.


Flags​:
  category=core
  severity=low


Site configuration information for perl 5.15.4​:

Configured by sprout at Wed Nov 2 09​:06​:14 PDT 2011.

Summary of my perl5 (revision 5 version 15 subversion 4) configuration​:
  Snapshot of​: f364061
  Platform​:
  osname=darwin, osvers=10.5.0, archname=darwin-thread-multi-2level
  uname='darwin pint.local 10.5.0 darwin kernel version 10.5.0​: fri nov 5 23​:20​:39 pdt 2010; root​:xnu-1504.9.17~1release_i386 i386 '
  config_args='-de -Doptimize=-g -Dusedevel -Duseithreads -Dmad'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=define, usemultiplicity=define
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
  optimize='-g',
  cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='4.2.1 (Apple Inc. build 5664)', gccosandvers=''
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags =' -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib
  libs=-ldbm -ldl -lm -lutil -lc
  perllibs=-ldl -lm -lutil -lc
  libc=, so=dylib, useshrplib=false, libperl=libperl.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
  cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup -L/usr/local/lib -fstack-protector'

Locally applied patches​:
 


@​INC for perl 5.15.4​:
  /usr/local/lib/perl5/site_perl/5.15.4/darwin-thread-multi-2level
  /usr/local/lib/perl5/site_perl/5.15.4
  /usr/local/lib/perl5/5.15.4/darwin-thread-multi-2level
  /usr/local/lib/perl5/5.15.4
  /usr/local/lib/perl5/site_perl
  .


Environment for perl 5.15.4​:
  DYLD_LIBRARY_PATH (unset)
  HOME=/Users/sprout
  LANG=en_US.UTF-8
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=/usr/bin​:/bin​:/usr/sbin​:/sbin​:/usr/local/bin​:/usr/X11/bin​:/usr/local/bin
  PERL_BADLANG (unset)
  SHELL=/bin/bash

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 3, 2012

From @Hugmeir

On Sun Nov 20 16​:23​:45 2011, sprout wrote​:

$ ./perl -Ilib -Mutf8 -MDevel​::Peek -e '*fòò​:: = *bǎr​::; Dump \%fòò​::'
SV = IV(0x8666cc) at 0x8666d0
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x867e40
SV = PVHV(0x808330) at 0x867e40
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x28c610
KEYS = 0
FILL = 0
MAX = 7
RITER = -1
EITER = 0x0
NAME = "bǎr"
NAMECOUNT = 2
ENAME = "bǎr", "f??"

Those question marks represent Latin-1 bytes that my UTF-8 terminal
could not render. bǎr is output in UTF-8.
---
Flags​:
category=core
severity=low
---
Site configuration information for perl 5.15.4​:

Configured by sprout at Wed Nov 2 09​:06​:14 PDT 2011.

Summary of my perl5 (revision 5 version 15 subversion 4)
configuration​:
Snapshot of​: f364061
Platform​:
osname=darwin, osvers=10.5.0, archname=darwin-thread-multi-2level
uname='darwin pint.local 10.5.0 darwin kernel version 10.5.0​: fri
nov 5 23​:20​:39 pdt 2010; root​:xnu-1504.9.17~1release_i386 i386 '
config_args='-de -Doptimize=-g -Dusedevel -Duseithreads -Dmad'
hint=recommended, useposix=true, d_sigaction=define
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define,
usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler​:
cc='cc', ccflags ='-fno-common -DPERL_DARWIN -fno-strict-aliasing
-pipe -fstack-protector -I/usr/local/include',
optimize='-g',
cppflags='-fno-common -DPERL_DARWIN -fno-strict-aliasing -pipe
-fstack-protector -I/usr/local/include'
ccversion='', gccversion='4.2.1 (Apple Inc. build 5664)',
gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define,
longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries​:
ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='
-fstack-protector -L/usr/local/lib'
libpth=/usr/local/lib /usr/lib
libs=-ldbm -ldl -lm -lutil -lc
perllibs=-ldl -lm -lutil -lc
libc=, so=dylib, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking​:
dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags=' -bundle -undefined dynamic_lookup
-L/usr/local/lib -fstack-protector'

Locally applied patches​:

---
@​INC for perl 5.15.4​:
/usr/local/lib/perl5/site_perl/5.15.4/darwin-thread-multi-2level
/usr/local/lib/perl5/site_perl/5.15.4
/usr/local/lib/perl5/5.15.4/darwin-thread-multi-2level
/usr/local/lib/perl5/5.15.4
/usr/local/lib/perl5/site_perl
.

---
Environment for perl 5.15.4​:
DYLD_LIBRARY_PATH (unset)
HOME=/Users/sprout
LANG=en_US.UTF-8
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=/usr/bin​:/bin​:/usr/sbin​:/sbin​:/usr/local/bin​:/usr/X11/bin​:/
usr/local/bin
PERL_BADLANG (unset)
SHELL=/bin/bash

Howdy all. I'm looking for opinions on how to go about fixing this.
Usually, when Devel​::Peek finds something with the UTF-8 flag on, it'll
display it like this​:
$ perl -MDevel​::Peek -E 'Dump "\x{30cb}"'

SV = PV(0x90d8a94) at 0x90f5b4c
  REFCNT = 1
  FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
  PV = 0x90fdc74 "\343\203\213"\0 [UTF8 "\x{30cb}"]
  CUR = 3
  LEN = 12

That is, "escaped-bytestring"\0 [UTF8 "escaped-character-string"].

Should it also follow that convention for stash names and the like? Or
should it just show the escaped character string? Neither of those, and
output UTF8 when possible? Or something else entirely?

To get the point across, here's something like what the output would
look like for the three options​:

$ perl -MDevel​::Peek -E '*{"f\xe9​::"} = *{"b\x{30cb}​::"}; Dump \%{"f
\xe9​::"}'

First option
SV = IV(0x8d62b38) at 0x8d62b3c
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x8e278bc
  SV = PVHV(0x8d1b39c) at 0x8e278bc
  REFCNT = 2
  FLAGS = (OOK,SHAREKEYS)
  ARRAY = 0x8e2399c
  KEYS = 0
  FILL = 0
  MAX = 7
  RITER = -1
  EITER = 0x0
  NAME = "b\343\203\213" [UTF8 "b\x{30cb}"]
  NAMECOUNT = 2
  ENAME = "b\343\203\213" [UTF8 "b\x{30cb}"], "f\351"

Second option
SV = IV(0x8d62b38) at 0x8d62b3c
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x8e278bc
  SV = PVHV(0x8d1b39c) at 0x8e278bc
  REFCNT = 2
  FLAGS = (OOK,SHAREKEYS)
  ARRAY = 0x8e2399c
  KEYS = 0
  FILL = 0
  MAX = 7
  RITER = -1
  EITER = 0x0
  NAME = "b\x{30cb}"
  NAMECOUNT = 2
  ENAME = "b\x{30cb}", "f\351"

Third option
SV = IV(0x8d62b38) at 0x8d62b3c
  REFCNT = 1
  FLAGS = (TEMP,ROK)
  RV = 0x8e278bc
  SV = PVHV(0x8d1b39c) at 0x8e278bc
  REFCNT = 2
  FLAGS = (OOK,SHAREKEYS)
  ARRAY = 0x8e2399c
  KEYS = 0
  FILL = 0
  MAX = 7
  RITER = -1
  EITER = 0x0
  NAME = "bニ"
  NAMECOUNT = 2
  ENAME = "bニ", "fé"

Personally, I think the first option sucks -- it sort of starts alright
but breaks down easily for other types, like a coderef, which gets
printed like "STASH" :​: "NAME", and would become "STASH" [UTF8
"STASH"] :​: "NAME" [UTF8 "NAME"] if both of those were in UTF8. But
admittedly this is a stylistic concern more than anything.

Meanwhile, the third makes debugging dependent on having a font that
can display all symbols and not getting anything invisible in your
names -- Otherwise,good luck discerning \xe9 and e\N{COMBINING ACUTE
ACCENT}, or why ref(bless {}, "Yep") ne ref(bless {}, "Ye\0p").

So I will probably go for the second, unless someone has objections and/
or a better idea.

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 3, 2012

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 3, 2012

From @cpansprout

On Thu Feb 02 20​:40​:07 2012, Hugmeir wrote​:

So I will probably go for the second, unless someone has objections and/
or a better idea.

2 sounds good.

--

Father Chrysostomos

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 3, 2012

From [Unknown Contact. See original ticket]

On Thu Feb 02 20​:40​:07 2012, Hugmeir wrote​:

So I will probably go for the second, unless someone has objections and/
or a better idea.

2 sounds good.

--

Father Chrysostomos

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 3, 2012

From @nwc10

On Thu, Feb 02, 2012 at 08​:40​:08PM -0800, Brian Fraser via RT wrote​:

Second option
SV = IV(0x8d62b38) at 0x8d62b3c
REFCNT = 1
FLAGS = (TEMP,ROK)
RV = 0x8e278bc
SV = PVHV(0x8d1b39c) at 0x8e278bc
REFCNT = 2
FLAGS = (OOK,SHAREKEYS)
ARRAY = 0x8e2399c
KEYS = 0
FILL = 0
MAX = 7
RITER = -1
EITER = 0x0
NAME = "b\x{30cb}"
NAMECOUNT = 2
ENAME = "b\x{30cb}", "f\351"

You're intentionally using octal to distinguish things-as-bytes from hex for
things-as-UTF-8? Or is that just a side effect of the values chosen?

Because I'm thinking that some (documented, unambiguous) convention like that
would make for better reading than an explicit longhand character sequence
all the time.

Personally, I think the first option sucks -- it sort of starts alright
but breaks down easily for other types, like a coderef, which gets
printed like "STASH" :​: "NAME", and would become "STASH" [UTF8
"STASH"] :​: "NAME" [UTF8 "NAME"] if both of those were in UTF8. But
admittedly this is a stylistic concern more than anything.

Meanwhile, the third makes debugging dependent on having a font that
can display all symbols and not getting anything invisible in your
names -- Otherwise,good luck discerning \xe9 and e\N{COMBINING ACUTE
ACCENT}, or why ref(bless {}, "Yep") ne ref(bless {}, "Ye\0p").

Yes, but in both cases the "style" is really about conveying information
accurately without clutter, so it's important.

So I will probably go for the second, unless someone has objections and/
or a better idea.

Yes, the second looks the best idea (so far)
We can change it if someone has a better idea. The dump format isn't
sacrosanct.

Nicholas Clark

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 3, 2012

From @demerphq

On 3 February 2012 14​:40, Nicholas Clark <nick@​ccl4.org> wrote​:

    ENAME = "b\x{30cb}", "f\351"

You're intentionally using octal to distinguish things-as-bytes from hex for
things-as-UTF-8? Or is that just a side effect of the values chosen?

As an aside, there are number of bits of code that use octal for
codepoints <= 255, and hex for codepoints > 255.

I personally hate it, for some reason I don't think octal anywhere
near as well as hex and i find it really confusing when the same line
has both. The code for emitting a quoted escaped string in perl
supports a few modes, we could decide to use whatever we want.

cheer
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 4, 2012

From @demerphq

On 3 February 2012 17​:48, demerphq <demerphq@​gmail.com> wrote​:

On 3 February 2012 14​:40, Nicholas Clark <nick@​ccl4.org> wrote​:

    ENAME = "b\x{30cb}", "f\351"

You're intentionally using octal to distinguish things-as-bytes from hex for
things-as-UTF-8? Or is that just a side effect of the values chosen?

As an aside, there are number of bits of code that use octal for
codepoints <= 255, and hex for codepoints > 255.

I personally hate it, for some reason I don't think octal anywhere
near as well as hex and i find it really confusing when the same line
has both. The code for emitting a quoted escaped string in perl
supports a few modes, we could decide to use whatever we want.

I should have added that the argument I have seen in at least one
place in the code (as a comment) is that octal is used for low byte
escapes because it is shorter. IOW, 100 nulls will be 200 chars long,
whereas with unbraced hex it would be 400, and with braces 500.

I personally think for stuff like this the rule should be, if there is
a named escape use it, if it is null use \0, otherwise use braced hex
if it is codepoints, and unbraced hex (2 digit) if it is bytes being
output.

I also think that Dump output should be in ASCII unless requested to
do otherwise.

Also, I will note that the regex engine debug output does not use \ as
the escape character (anymore (for a long time)), it uses % so as to
make it absolutely clear whether we are talking about an escape from
dumping, or an escape in the pattern. So there is precedence for
having diagnostics be a little different from the normal rules of
perl.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 4, 2012

From @nwc10

On Sat, Feb 04, 2012 at 11​:11​:51AM +0100, demerphq wrote​:

On 3 February 2012 17​:48, demerphq <demerphq@​gmail.com> wrote​:

On 3 February 2012 14​:40, Nicholas Clark <nick@​ccl4.org> wrote​:

    ENAME = "b\x{30cb}", "f\351"

You're intentionally using octal to distinguish things-as-bytes from hex for
things-as-UTF-8? Or is that just a side effect of the values chosen?

As an aside, there are number of bits of code that use octal for
codepoints <= 255, and hex for codepoints > 255.

I personally hate it, for some reason I don't think octal anywhere
near as well as hex and i find it really confusing when the same line
has both. The code for emitting a quoted escaped string in perl
supports a few modes, we could decide to use whatever we want.

Although thinking further I realise that that these two *aren't* the same,
and a dump output should continue to show that​:

$ ./perl -Ilib -MDevel​::Peek -e '$a = "N" . chr 255; chop $a; Dump($a)'
SV = PV(0x100801070) at 0x100812ae8
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x100601b80 "N"\0
  CUR = 1
  LEN = 16
$ ./perl -Ilib -MDevel​::Peek -e '$a = "N" . chr 256; chop $a; Dump($a)'
SV = PV(0x100801070) at 0x100812ae8
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x100601b80 "N"\0 [UTF8 "N"]
  CUR = 1
  LEN = 16

I should have added that the argument I have seen in at least one
place in the code (as a comment) is that octal is used for low byte
escapes because it is shorter. IOW, 100 nulls will be 200 chars long,
whereas with unbraced hex it would be 400, and with braces 500.

I personally think for stuff like this the rule should be, if there is
a named escape use it, if it is null use \0, otherwise use braced hex
if it is codepoints, and unbraced hex (2 digit) if it is bytes being
output.

Thinking about that, I like it. It also avoids any confusion between string
escapes and backslash escapes, and things like "\0123" (a.k.a "\n3", not "S")

Although "\0" will need to be special cased in some fashion if followed by
a digit. either as "\000" or "\x00". Possibly the latter.

I also think that Dump output should be in ASCII unless requested to
do otherwise.

"printable" ASCII. (As you implied above)
Agree. Because really the lowest common denominator is all that can be relied
on.

Also, I will note that the regex engine debug output does not use \ as
the escape character (anymore (for a long time)), it uses % so as to
make it absolutely clear whether we are talking about an escape from
dumping, or an escape in the pattern. So there is precedence for
having diagnostics be a little different from the normal rules of
perl.

Which might mean a B or U prefix. (As Devel​::Peek effectively has a \0
suffix)

As an aside, Devel​::Peek's tests are probably the right place to test this
sort of stuff.

Nicholas Clark

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Feb 5, 2012

From @rjbs

* demerphq <demerphq@​gmail.com> [2012-02-04T05​:11​:51]

I personally hate it, for some reason I don't think octal anywhere
near as well as hex and i find it really confusing when the same line
has both. The code for emitting a quoted escaped string in perl
supports a few modes, we could decide to use whatever we want.
[…]
I personally think for stuff like this the rule should be, if there is
a named escape use it, if it is null use \0, otherwise use braced hex
if it is codepoints, and unbraced hex (2 digit) if it is bytes being
output.

I have the same dislike for the current behavior, and your suggestion seems
like about what I'd like, too.

--
rjbs

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 13, 2017

From zefram@fysh.org

This was fixed in commit 0eb335d in
Perl 5.19.8.

-zefram

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Dec 13, 2017

@iabyn - Status changed from 'open' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant