From tchrist@perl.com

Nicholas Clark <nick@ccl4.org> wrote
on Fri, 12 Aug 2011 10:23:09 BST:

On Fri, Aug 12, 2011 at 02:10:55AM -0600, Tom Christiansen wrote:

I was worried about how this plays with Apple's HSF+, given
that it uses NFD. If you can a module named Écran, I get nervous
about how it gains a code point in length in the filesystem.

Strictly it doesn't:

...

It's a snapshot of NFD - I think even a snapshot of a late NFD *draft*.
And it's not allowed to change.

I usually hedge that by saying that it's quasi-NFD. I don't know any
module that implements it, so it's really annoying to predict. I hate
the poke it and see what shows up approach, but maybe that's all one
can do.

Which I think was an issue Father C raised - Unicode evolves, therefore
normalisation changes. Should Perl snapshot a particular normalisation and
keep that as canonical forever? Or should we run the (small risk) that
(dangerously written) scripts will change behaviour as a side effect of
running on a perl (newer or older) that doesn't use the same Unicode database.

Is the fear that an unassigned code point would later get assigned something
that changes under normalization? If people are using unassigned code points,
then I suppose this may happen, but I can't see any other way. That's because
of Unicode's strong stability guarantee on normalization. The key point is
the last of the lines I quote below:

http://unicode.org/policies/stability_policy.html

Unlike many other standards, the Unicode Standard is continually
expanding—new characters are added to meet a variety of uses, ranging from
technical symbols to letters for archaic languages. Character properties
are also expanded or revised to meet implementation requirements.

In each new version of the Unicode Standard, the Unicode Consortium may add
characters or make certain changes to characters that were encoded in a
previous version of the standard. However, the Consortium imposes
limitations on the types of changes that can be made, in an effort to
minimize the impact on existing implementations.

...

Normalization Stability

Strong Normalization Stability
Applicable Version: Unicode 4.1+

If a string contains only characters from a given version of Unicode, and it
is put into a normalized form in accordance with that version of Unicode,
then the results will be identical to the results of putting that string
into a normalized form in accordance with any subsequent version of Unicode.

More formally, given versions V and U of Unicode, and any string S
which only contains characters assigned according to both V and U, the
following are always true:

toNFCV(S) = toNFCU(S)
toNFDV(S) = toNFDU(S)
toNFKCV(S) = toNFKCU(S)
toNFKDV(S) = toNFKDU(S)

In particular, once a character is encoded, its canonical combining
class and decomposition mapping will not be changed in any way.

Now, HSF+ came out in 1998, but the stability guarantee only applies to
Unicode version 4.1 and up, and 4.1 itself came out 2005-03-31.

This doesn't seem to be addressed at all in PEP 3131, so I'm assuming that
there isn't a working Python solution to adopt.

I can't see that they've done anything about bidis.

Does any language have a working implementation of normalised
Unicode identifiers?

What exactly do you mean by this? As I said, Python runs them
through NFC. This may have ramifications on HFS+. Python
issue 11230 is about being able to import library modules
with non-ASCII names, as

http://bugs.python.org/issue11230

And in particular

http://bugs.python.org/msg128724

which reads:

Short answer:

In Python 3.2, « import héhé » doesn't work on Windows, but you can have non-ASCII paths in sys.path.

Longer answer:

I fixed the import machinery to handle correctly non-ASCII characters
in module *paths*. But the import machinery is unable to handle
non-ASCII characters in module *names*: it fails if the filesystem
encoding is not UTF-8 (eg. it fails on Windows). There is another
exception: Python doesn't support (yet) non encodable module paths on
Windows. On Windows, you can use any character in directory names, but
Python 3.2 encodes paths to the filesystem encoding (ANSI code page)
which is a smaller charset. In practical, this Windows specific
limitation on module paths doesn't really matter.

I plan to fix all these issues in Python 3.3: see #3080.

> Could you please make it clear in documentation and web pages,
> that this feature is not working yet.

What's New in Python 3.2 documentation has this sentence: "Python’s
import mechanism can now load modules installed in directories with
non-ASCII characters in the path name. This solved an aggravating
problem with home directories for users with non-ASCII characters in
their usernames." which is correct.

Which web page should updated/fixed?

So I don't think they have it working in module names either. Besides
Perl, all of Python, Ruby, Java, and Go offer Unicode identifiers, with
various restrictions.

* Python does seem to do the IDS/IDC thing, so you might see idents
with combining marks, but these are run through NFC so tend to go
away for the common cases.

* Java I know to have filesystem issues, but Java also allows for
random control characters in its identifiers, which it completely
ignores and do not become part of those names.

* In contrast Go does not seem to use IDS/IDC, because you get compiler
errors if you have combining marks (NFD forms):

% 6g idents.go
idents.go:4: invalid identifier character 0x301
idents.go:5: invalid identifier character 0x301

% uniquote -x < idents.go
package main
func main() {
var \x{E9}cran = "NFC screen"
var e\x{301}cran = "NFD screen"
println("tes \x{E9}crans sont ", \x{E9}cran, " and ", e\x{301}cran)
}

So it doesn't mind E9, but dislikes 301.

(BTW, I keep making errors in Python because of there being no strict
vars declaration that I can find the equivalent of, whereas with
Go you don't have that problem.)

* I haven't poked at Ruby hard enough to know what it does here
with external names. But internally, NFC and NFD forms are
distinct instead of normalized:

% ruby ident.ruby
nfc
nfd

% uniquote -x < ident.ruby
#!/usr/bin/env ruby
#coding: utf-8
ni\x{F1}o = "nfc"
nin\x{303}o = "nfd"
puts ni\x{F1}o
puts nin\x{303}o

--tom

Perl needs to normalize its identifiers #11573

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions