Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl needs to normalize its identifiers #11573

Open
p5pRT opened this issue Aug 11, 2011 · 6 comments
Open

Perl needs to normalize its identifiers #11573

p5pRT opened this issue Aug 11, 2011 · 6 comments

Comments

@p5pRT
Copy link

p5pRT commented Aug 11, 2011

Migrated from rt.perl.org#96814 (status was 'open')

Searchable as RT96814$

@p5pRT
Copy link
Author

p5pRT commented Aug 11, 2011

From tchrist@perl.com

Python runs its Unicode identifiers through NFD transforms, although Perl,
Ruby, and Java do not. That means a user has to know which form all his
idents are in, and which form his editor condescended to enter for him,
even though he cannot see which is which in his editor. This is prone to
bugs and errors, some of which will go long unnoticed.

*You* cannot tell which one got entered, and *you* cannot see which is
which, but Perl distinguished otherwise identifical things.

How can this possibly not be a bug?

I get figure out a tie map for hashes to make this work right, so that your
strings are autonormalized, but I cannot figure out how to do that sort of
magic to lookups in stashes, let alone in pads.

Since this is something each user must take especially care to do "right"
every single time, or else he gets bugs, it is something that Perl should
be doing for him, based on the proven principle that nothing too important
to risk bieng forgotten should be *able* to be forgotten.

--tom

Summary of my perl5 (revision 5 version 14 subversion 0) configuration​:
 
  Platform​:
  osname=openbsd, osvers=4.4, archname=OpenBSD.i386-openbsd
  uname='openbsd chthon 4.4 generic#0 i386 '
  config_args='-des'
  hint=recommended, useposix=true, d_sigaction=define
  useithreads=undef, usemultiplicity=undef
  useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
  use64bitint=undef, use64bitall=undef, uselongdouble=undef
  usemymalloc=y, bincompat5005=undef
  Compiler​:
  cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include',
  optimize='-O2',
  cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include'
  ccversion='', gccversion='3.3.5 (propolice)', gccosandvers='openbsd4.4'
  intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
  alignbytes=4, prototype=define
  Linker and Libraries​:
  ld='cc', ldflags ='-Wl,-E -fstack-protector -L/usr/local/lib'
  libpth=/usr/local/lib /usr/lib
  libs=-lgdbm -lm -lutil -lc
  perllibs=-lm -lutil -lc
  libc=/usr/lib/libc.so.48.0, so=so, useshrplib=false, libperl=libperl.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
  cccdlflags='-DPIC -fPIC ', lddlflags='-shared -fPIC -L/usr/local/lib -fstack-protector'

Characteristics of this binary (from libperl)​:
  Compile-time options​: MYMALLOC PERL_DONT_CREATE_GVSV PERL_MALLOC_WRAP
  PERL_PRESERVE_IVUV USE_LARGE_FILES USE_PERLIO
  USE_PERL_ATOF
  Built under openbsd
  Compiled at Jun 11 2011 11​:48​:28
  %ENV​:
  PERL_UNICODE="SA"
  @​INC​:
  /usr/local/lib/perl5/site_perl/5.14.0/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/site_perl/5.14.0
  /usr/local/lib/perl5/5.14.0/OpenBSD.i386-openbsd
  /usr/local/lib/perl5/5.14.0
  /usr/local/lib/perl5/site_perl/5.12.3
  /usr/local/lib/perl5/site_perl/5.11.3
  /usr/local/lib/perl5/site_perl/5.10.1
  /usr/local/lib/perl5/site_perl/5.10.0
  /usr/local/lib/perl5/site_perl/5.8.7
  /usr/local/lib/perl5/site_perl/5.8.0
  /usr/local/lib/perl5/site_perl/5.6.0
  /usr/local/lib/perl5/site_perl/5.005
  /usr/local/lib/perl5/site_perl
  .

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2011

From @Hugmeir

On Thu, Aug 11, 2011 at 4​:39 PM, tchrist1 <perlbug-followup@​perl.org> wrote​:

Python runs its Unicode identifiers through NFD transforms, although Perl,
Ruby, and Java do not.

Does Python use NFD? PEP 3131 recommends either NFC or NFKC, but I haven't
gotten too far into the accompanying discussion.

In any case, I agree that this needs to change, but I have doubts on how it
would be called from Perl-space. 'use normalization qw< NFD >;' implies that
all of the source is normalized, including string literals, so you'd
actually need to do something like 'use normalization indentifiers =>
"NFD";' to avoid confusion... But that gives the impression that you can
also normalize other areas. And what about symbolic references, should those
be normalized too? Can you opt(in|out) of that? :)

I get figure out a tie map for hashes to make this work right, so that your
strings are autonormalized, but I cannot figure out how to do that sort of
magic to lookups in stashes, let alone in pads.

Tieing stashes is broken, so that won't do for the moment. Without giving it
much thought, I imagine we could "simply" add checks in the core, or maybe
install store/fetch hooks for GVs/pads, if those aren't a hugely terrible
idea.

Unrelated to the bug report, what does Python do with bidi control
characters? The PEP thread has a couple of suggestions (
http​://mail.python.org/pipermail/python-3000/2007-May/007750.html,
http​://mail.python.org/pipermail/python-3000/2007-May/007823.html,<http​://mail.python.org/pipermail/python-3000/2007-May/007823.html>
http​://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I
don't how what they ended up implementing.

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2011

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2011

From tchrist@perl.com

"Brian Fraser via RT" <perlbug-followup@​perl.org> wrote
  on Fri, 12 Aug 2011 00​:26​:34 PDT​:

Python runs its Unicode identifiers through NFD transforms, although
Perl, Ruby, and Java do not.

Does Python use NFD? PEP 3131 recommends either NFC or NFKC, but I haven't
gotten too far into the accompanying discussion.

Sorry, you're right, it's NFC​:

  #!/usr/bin/env python3.2
  # -*- coding​: UTF-8 -*-
  écran = "NFD screen"
  écran = "NFC screen"
  print("First screen is", écran)
  print("Second screen is", écran)

print out

  First screen is NFC screen
  Second screen is NFC screen

I was worried about how this plays with Apple's HSF+, given
that it uses NFD. If you can a module named Écran, I get nervous
about how it gains a code point in length in the filesystem.

In any case, I agree that this needs to change, but I have doubts on how it
would be called from Perl-space. 'use normalization qw< NFD >;' implies that
all of the source is normalized, including string literals, so you'd
actually need to do something like 'use normalization indentifiers =>
"NFD";' to avoid confusion... But that gives the impression that you can
also normalize other areas. And what about symbolic references, should those
be normalized too? Can you opt(in|out) of that? :)

I agree that it has to be just for identifiers, not string literals,
because there are times you need to compare with something exactly.

  $nfd = "écran";
  $nfc = "écran";

Those need to be distinct.

I think the solution for hashes should probably be a tie layer
that normalizes its keys. That doesn't require any core changes.

I get figure out a tie map for hashes to make this work right, so that your
strings are autonormalized, but I cannot figure out how to do that sort of
magic to lookups in stashes, let alone in pads.

Tieing stashes is broken, so that won't do for the moment.

I was kinda just kidding, because I did remember this.

Without giving it much thought, I imagine we could "simply" add checks
in the core, or maybe install store/fetch hooks for GVs/pads, if those
aren't a hugely terrible idea.

Unrelated to the bug report, what does Python do with bidi control
characters? The PEP thread has a couple of suggestions (

http​://mail.python.org/pipermail/python-3000/2007-May/007750.html,
http​://mail.python.org/pipermail/python-3000/2007-May/007823.html,
<http​://mail.python.org/pipermail/python-3000/2007-May/007823.html>
http​://mail.python.org/pipermail/python-3000/2007-May/007826.html) but I
don't how what they ended up implementing.

Haven't looked at that. Bidi is ugly, since Perl stuff goes left to
right, and an RTL string could flip around weak bidi mirrors so they
look different.

Interesting​:

I'll repeat that UTR#39 explicitly discourages support
for formatting characters in identifiers.

And this one

  http​://mail.python.org/pipermail/python-3000/2007-May/007725.html

points out that Java can get away with this because they have all these
default-ignorables they let by in source code. Yes, you can put nulls and
bells all over your Java source and the compiler will ignore them outside
literals. Scary.

This

  http​://mail.python.org/pipermail/python-3000/2007-May/007833.html

seems as far as they got. I don't see any resolution. Too tired to
hack out stupid bidi tricks right now to test.

Hm, I wonder whether this has anything useful to say about the matter,
since they've had to think about it for URLs​:

  http​://www.w3.org/International/iri-edit/draft-duerst-iri-05.txt

--tom

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2011

From @nwc10

On Fri, Aug 12, 2011 at 02​:10​:55AM -0600, Tom Christiansen wrote​:

I was worried about how this plays with Apple's HSF+, given
that it uses NFD. If you can a module named Écran, I get nervous
about how it gains a code point in length in the filesystem.

Strictly it doesn't​:

http​://developer.apple.com/library/mac/technotes/tn/tn1150.html#UnicodeSubtleties

  IMPORTANT​:

  An implementation must not use the Unicode utilities implemented
  by its native platform (for decomposition and comparison), unless
  those algorithms are equivalent to the HFS Plus algorithms defined
  here, and are guaranteed to be so forever. This is rarely the
  case. Platform algorithms tend to evolve with the Unicode
  standard. The HFS Plus algorithms cannot evolve because such
  evolution would invalidate existing HFS Plus volumes.

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*.
And it's not allowed to change.

Which I think was an issue Father C raised - Unicode evolves, therefore
normalisation changes. Should Perl snapshot a particular normalisation and
keep that as canonical forever? Or should we run the (small risk) that
(dangerously written) scripts will change behaviour as a side effect of
running on a perl (newer or older) that doesn't use the same Unicode database.

This doesn't seem to be addressed at all in PEP 3131, so I'm assuming that
there isn't a working Python solution to adopt.

This

http&#8203;://mail\.python\.org/pipermail/python\-3000/2007\-May/007833\.html

seems as far as they got. I don't see any resolution. Too tired to
hack out stupid bidi tricks right now to test.

Shame.

Does any language have a working implementation of normalised Unicode
identifiers?

Nicholas Clark

@p5pRT
Copy link
Author

p5pRT commented Aug 12, 2011

From tchrist@perl.com

Nicholas Clark <nick@​ccl4.org> wrote
  on Fri, 12 Aug 2011 10​:23​:09 BST​:

On Fri, Aug 12, 2011 at 02​:10​:55AM -0600, Tom Christiansen wrote​:

I was worried about how this plays with Apple's HSF+, given
that it uses NFD. If you can a module named Écran, I get nervous
about how it gains a code point in length in the filesystem.

Strictly it doesn't​:

...

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*.
And it's not allowed to change.

I usually hedge that by saying that it's quasi-NFD. I don't know any
module that implements it, so it's really annoying to predict. I hate
the poke it and see what shows up approach, but maybe that's all one
can do.

Which I think was an issue Father C raised - Unicode evolves, therefore
normalisation changes. Should Perl snapshot a particular normalisation and
keep that as canonical forever? Or should we run the (small risk) that
(dangerously written) scripts will change behaviour as a side effect of
running on a perl (newer or older) that doesn't use the same Unicode database.

Is the fear that an unassigned code point would later get assigned something
that changes under normalization? If people are using unassigned code points,
then I suppose this may happen, but I can't see any other way. That's because
of Unicode's strong stability guarantee on normalization. The key point is
the last of the lines I quote below​:

  http​://unicode.org/policies/stability_policy.html

  Unlike many other standards, the Unicode Standard is continually
  expanding—new characters are added to meet a variety of uses, ranging from
  technical symbols to letters for archaic languages. Character properties
  are also expanded or revised to meet implementation requirements.

  In each new version of the Unicode Standard, the Unicode Consortium may add
  characters or make certain changes to characters that were encoded in a
  previous version of the standard. However, the Consortium imposes
  limitations on the types of changes that can be made, in an effort to
  minimize the impact on existing implementations.

  ...

  Normalization Stability

  Strong Normalization Stability
  Applicable Version​: Unicode 4.1+

  If a string contains only characters from a given version of Unicode, and it
  is put into a normalized form in accordance with that version of Unicode,
  then the results will be identical to the results of putting that string
  into a normalized form in accordance with any subsequent version of Unicode.

  More formally, given versions V and U of Unicode, and any string S
  which only contains characters assigned according to both V and U, the
  following are always true​:

  toNFCV(S) = toNFCU(S)
  toNFDV(S) = toNFDU(S)
  toNFKCV(S) = toNFKCU(S)
  toNFKDV(S) = toNFKDU(S)

  In particular, once a character is encoded, its canonical combining
  class and decomposition mapping will not be changed in any way.

Now, HSF+ came out in 1998, but the stability guarantee only applies to
Unicode version 4.1 and up, and 4.1 itself came out 2005-03-31.

This doesn't seem to be addressed at all in PEP 3131, so I'm assuming that
there isn't a working Python solution to adopt.

I can't see that they've done anything about bidis.

Does any language have a working implementation of normalised
Unicode identifiers?

What exactly do you mean by this? As I said, Python runs them
through NFC. This may have ramifications on HFS+. Python
issue 11230 is about being able to import library modules
with non-ASCII names, as

  http​://bugs.python.org/issue11230

And in particular

  http​://bugs.python.org/msg128724

which reads​:

  Short answer​:

  In Python 3.2, « import héhé » doesn't work on Windows, but you can have non-ASCII paths in sys.path.

  Longer answer​:

  I fixed the import machinery to handle correctly non-ASCII characters
  in module *paths*. But the import machinery is unable to handle
  non-ASCII characters in module *names*​: it fails if the filesystem
  encoding is not UTF-8 (eg. it fails on Windows). There is another
  exception​: Python doesn't support (yet) non encodable module paths on
  Windows. On Windows, you can use any character in directory names, but
  Python 3.2 encodes paths to the filesystem encoding (ANSI code page)
  which is a smaller charset. In practical, this Windows specific
  limitation on module paths doesn't really matter.

  I plan to fix all these issues in Python 3.3​: see #3080.

  --

  > Could you please make it clear in documentation and web pages,
  > that this feature is not working yet.

  What's New in Python 3.2 documentation has this sentence​: "Python’s
  import mechanism can now load modules installed in directories with
  non-ASCII characters in the path name. This solved an aggravating
  problem with home directories for users with non-ASCII characters in
  their usernames." which is correct.

  Which web page should updated/fixed?

So I don't think they have it working in module names either. Besides
Perl, all of Python, Ruby, Java, and Go offer Unicode identifiers, with
various restrictions.

* Python does seem to do the IDS/IDC thing, so you might see idents
  with combining marks, but these are run through NFC so tend to go
  away for the common cases.

* Java I know to have filesystem issues, but Java also allows for
  random control characters in its identifiers, which it completely
  ignores and do not become part of those names.

* In contrast Go does not seem to use IDS/IDC, because you get compiler
  errors if you have combining marks (NFD forms)​:

  % 6g idents.go
  idents.go​:4​: invalid identifier character 0x301
  idents.go​:5​: invalid identifier character 0x301

  % uniquote -x < idents.go
  package main
  func main() {
  var \x{E9}cran = "NFC screen"
  var e\x{301}cran = "NFD screen"
  println("tes \x{E9}crans sont ", \x{E9}cran, " and ", e\x{301}cran)
  }

  So it doesn't mind E9, but dislikes 301.

  (BTW, I keep making errors in Python because of there being no strict
  vars declaration that I can find the equivalent of, whereas with
  Go you don't have that problem.)

* I haven't poked at Ruby hard enough to know what it does here
  with external names. But internally, NFC and NFD forms are
  distinct instead of normalized​:

  % ruby ident.ruby
  nfc
  nfd

  % uniquote -x < ident.ruby
  #!/usr/bin/env ruby
  #coding​: utf-8
  ni\x{F1}o = "nfc"
  nin\x{303}o = "nfd"
  puts ni\x{F1}o
  puts nin\x{303}o

--tom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants