Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl's open() has broken Unicode file name support #15883

Open
p5pRT opened this issue Feb 21, 2017 · 23 comments
Open

Perl's open() has broken Unicode file name support #15883

p5pRT opened this issue Feb 21, 2017 · 23 comments

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Feb 21, 2017

Migrated from rt.perl.org#130831 (status was 'open')

Searchable as RT130831$

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 21, 2017

From @pali

Function open() has broken processing of non-ASCII file names.

Look at these two examples​:

$ perl -e 'open my $file, ">", "\N{U+FF}"'

$ perl -e 'open my $file, ">", "\xFF"'

First one create file with name 0xc3 0xbf (ÿ), second one with name 0xff

And because those two strings "\N{U+FF}" and "\xFF" are equal they must
create same file, not two different.

$ perl -e '"\xFF" eq "\N{U+FF}" && print "equal\n"'
equal

Bug is in open() implementation in PP(pp_open) in file pp_sys.c.

File name is read from perl scalar to C char* as​:

tmps = SvPV_const(sv, len);

But after that SvUTF8(sv) is *not* used to check if char* tmps is
encoded in UTF-8 or Latin1. It pass tmps directly to do_open6() function
without SvUTF8 information.

So to fixing this bug it is needed to define how function open should
process filename. Either as binary octets and SvPVbyte() instead of
SvPV() should be used, or as Unicode string and SvPVutf8() instead of
SvPV() should be used.

It also means that it is needed to define what Perl_do_open6() should
expect. Its argument for file name is of type​: const char *oname. It
should be either binary octets or UTF-8.

There are basically two problems with it​:

1) On some systems (e.g. on Linux) file name could be arbitrary sequence
of binary characters. It does not have to be valid UTF-8 representation.

2) Perl modules probably already uses perl Unicode scalars as argument
for file names

And decision should still allow to open any file on VFS from 1) and
probably should not break 2). And I'm not sure if it is possible to have
both 1) and 2) together.

Current state is worse as both 1) and 2) is broken.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 26, 2017

From @jkeenan

On Tue, 21 Feb 2017 20​:58​:03 GMT, pali@​cpan.org wrote​:

Function open() has broken processing of non-ASCII file names.

Look at these two examples​:

$ perl -e 'open my $file, ">", "\N{U+FF}"'

$ perl -e 'open my $file, ">", "\xFF"'

First one create file with name 0xc3 0xbf (ÿ), second one with name 0xff

And because those two strings "\N{U+FF}" and "\xFF" are equal they must
create same file, not two different.

$ perl -e '"\xFF" eq "\N{U+FF}" && print "equal\n"'
equal

Bug is in open() implementation in PP(pp_open) in file pp_sys.c.

File name is read from perl scalar to C char* as​:

tmps = SvPV_const(sv, len);

But after that SvUTF8(sv) is *not* used to check if char* tmps is
encoded in UTF-8 or Latin1. It pass tmps directly to do_open6() function
without SvUTF8 information.

So to fixing this bug it is needed to define how function open should
process filename. Either as binary octets and SvPVbyte() instead of
SvPV() should be used, or as Unicode string and SvPVutf8() instead of
SvPV() should be used.

It also means that it is needed to define what Perl_do_open6() should
expect. Its argument for file name is of type​: const char *oname. It
should be either binary octets or UTF-8.

There are basically two problems with it​:

1) On some systems (e.g. on Linux) file name could be arbitrary sequence
of binary characters. It does not have to be valid UTF-8 representation.

2) Perl modules probably already uses perl Unicode scalars as argument
for file names

And decision should still allow to open any file on VFS from 1) and
probably should not break 2). And I'm not sure if it is possible to have
both 1) and 2) together.

Current state is worse as both 1) and 2) is broken.

ISTR seeing a fair amount of discussion of this issue on #p5p. Would anyone care to summarize this discussion?

Thank you very much.

--
James E Keenan (jkeenan@​cpan.org)

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 26, 2017

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 27, 2017

From @pali

Some more informations​:

Windows has two sets of functions for accessing files. First with -A
suffix which takes file names in encoding of current 8bit codepage.
Second with -W suffix which takes file names in Unicode (more precisely
in Windows variant of UTF-16). With -A functions it is possible to
access only those files which file names contains only characters
available in current 8bit codepage. Internally are all file names stored
in Unicode. So -W functions must be used to have access to any file
name. And therefore for Windows we need Unicode file name in perl open()
function to have access to any file stored on disk.

Linux stores file names in binary octets, there is no encoding or
requirement for Unicode. Therefore to access any file on Linux, Perl's
open() function should takes downgraded/non-Unicode file name.

Which means there is no way to have uniform and same multiplaform
support for file access without hacks.

I'm thinking that for Linux we could specify some (hint) variable which
will contains encoding name (it can be hidden in some pragma module...).
And then Perl's open() function can takes Unicode file name and can
convert it to encoding (specified by that variable). As default value
for that variable (for encoding) can be used from locale or defaults to
UTF-8 (which is probably most used and sane default value).

This would allow us to have uniform open() function with takes Unicode
file name on (probably) any platform. I think this is the only sane
approach if Perl want to support Unicode file names.

But problem is how currently Perl's open() function is implemented. It
expects bytes or Unicode string?

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 27, 2017

From zefram@fysh.org

pali@​cpan.org wrote​:

Which means there is no way to have uniform and same multiplaform
support for file access without hacks.

Depends what you're trying to do "uniformly". If you want to be able
to open any file, then each platform has an obvious way of representing
any filename as a Perl string (as a full Unicode string on Windows and
as an octet string on Unix), so using Perl strings for filenames could
be a uniform interface. The format of filename strings does vary between
platforms, but we already have such variation in the directory separators,
and we have File​::Spec to provide a uniform interface to it.

The thing that can't be done uniformly is to generate a filename from an
arbitrary Unicode string in accordance with the platform's conventions.
We could of course add a File​::Spec method that attempts to do this, but
there's a fundamental problem that Unix doesn't actually have a consistent
convention for it. But this isn't really a big problem. We don't need
to use arbitrary Unicode strings, that weren't intended to be filenames,
as filenames. It's something to avoid​: a lot of security problems have
arisen from programs that tried to use arbitrary data strings in this way.

The strings that we should be using as filenames are strings that are
explicitly specified by the user as filenames. The user, at runtime,
can be expected to be aware of platform conventions and to supply
locally-appropriate filenames.

I'm thinking that for Linux we could specify some (hint) variable which
will contains encoding name (it can be hidden in some pragma module...).

Ugh. If the `hint' is lexically scoped, this loses as soon as a
filename crosses a module boundary. If global, that would be saner;
it's effectively part of the interface to the OS. But you then have
a backcompat issue that you have to handle encoding failures in code
paths that currently never generate exceptions. There's also a terrible
problem with OS interfaces that return filenames (readdir(3), readlink(2),
et al)​: you have to *decode* the filename, and if it doesn't decode then
you've lost the ability to work with arbitrary existing files.

                                                As default value 

for that variable (for encoding) can be used from locale or defaults to
UTF-8 (which is probably most used and sane default value).

These are both crap as defaults. The locale's nominal encoding is quite
likely to be ASCII, and both ASCII and UTF-8 are incapable of generating
certain octet strings as output. Thus if filenames are subjected to
either of these encodings then it is impossible for the user to specify
some filenames that are valid at the syscall interface, and if such a
filename actually exists then you run into the above-mentioned decoding
problem. For example, the one-octet string "\xc0" doesn't decode as
either ASCII or UTF-8. The only sane default, if you want to offer this
encoding system, is Latin-1, which behaves as a null encoding on Perl
octet strings.

The trouble here really arises because the scheme effectively uses the
encoding in reverse. Normally we use a character encoding to encode a
character string as an octet string, so that we can store those octets
and later read them to recover the original character string. With Unix
filenames, however, the thing that we want to represent and store, which
is the filename as it appears at the OS interface, is an octet string.
The encoding layer, if there is one, is concerned with representing
that octet string as a character string. An encoding that can't handle
all octet strings is a problem, just as in normal circumstances a
character encoding that can't handle all character strings is a problem.
Most character encodings are just not designed to be used in reverse,
and don't have a design goal of encoding to all octet strings or of
decode-then-encode round-tripping.

But problem is how currently Perl's open() function is implemented. It
expects bytes or Unicode string?

The current behaviour is broken on any platform. To get to anything sane
we will need a change that breaks some backcompat. In that situation
we are not constrained by the present arrangement of the open() internals.

-zefram

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 27, 2017

From @Leont

On Mon, Feb 27, 2017 at 10​:21 PM, <pali@​cpan.org> wrote​:

Windows has two sets of functions for accessing files. First with -A
suffix which takes file names in encoding of current 8bit codepage.
Second with -W suffix which takes file names in Unicode (more precisely
in Windows variant of UTF-16). With -A functions it is possible to
access only those files which file names contains only characters
available in current 8bit codepage. Internally are all file names stored
in Unicode. So -W functions must be used to have access to any file
name. And therefore for Windows we need Unicode file name in perl open()
function to have access to any file stored on disk.

Linux stores file names in binary octets, there is no encoding or
requirement for Unicode. Therefore to access any file on Linux, Perl's
open() function should takes downgraded/non-Unicode file name.

Which means there is no way to have uniform and same multiplaform
support for file access without hacks.

Correct observations. Except OS X makes this more complicated still​:
it uses UTF-8 encoded bytes, normalized using a non-standard variation
of NFD.

I'm thinking that for Linux we could specify some (hint) variable which
will contains encoding name (it can be hidden in some pragma module...).
And then Perl's open() function can takes Unicode file name and can
convert it to encoding (specified by that variable). As default value
for that variable (for encoding) can be used from locale or defaults to
UTF-8 (which is probably most used and sane default value).

This would allow us to have uniform open() function with takes Unicode
file name on (probably) any platform. I think this is the only sane
approach if Perl want to support Unicode file names.

I would welcome a 'unicode_filenames' feature. I don't think any value
other than binary is sane on Linux though. I think we learned from
perl 5.8.0.

But problem is how currently Perl's open() function is implemented. It
expects bytes or Unicode string?

Both. Neither. Welcome to The Unicode Bug.

Leon

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 1, 2017

From @tonycoz

On Tue, 21 Feb 2017 12​:58​:03 -0800, pali@​cpan.org wrote​:

So to fixing this bug it is needed to define how function open should
process filename. Either as binary octets and SvPVbyte() instead of
SvPV() should be used, or as Unicode string and SvPVutf8() instead of
SvPV() should be used.

It also means that it is needed to define what Perl_do_open6() should
expect. Its argument for file name is of type​: const char *oname. It
should be either binary octets or UTF-8.

This sounds like something that could be prototyped on CPAN by replacing CORE​::GLOBAL​::open, CORE​::GLOBAL​::readdir etc.

Tony

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 1, 2017

From @pali

On Monday 27 February 2017 15​:27​:32 Zefram via RT wrote​:

I'm thinking that for Linux we could specify some (hint) variable which
will contains encoding name (it can be hidden in some pragma module...).

Ugh. If the `hint' is lexically scoped, this loses as soon as a
filename crosses a module boundary. If global, that would be saner;

Yes, global. Ideally something which can be set when starting perl (e.g.
perl parameter) or via env variable.

it's effectively part of the interface to the OS.

Yes. And due to this reasons modules in normal cases should not change
value of that variable.

But you then have
a backcompat issue that you have to handle encoding failures in code
paths that currently never generate exceptions. There's also a terrible
problem with OS interfaces that return filenames (readdir(3), readlink(2),
et al)​: you have to *decode* the filename, and if it doesn't decode then
you've lost the ability to work with arbitrary existing files.

We can use Encode​::encode() function in non-croak mode which replace
invalid characters by some replacement and throw warning about it.

This could be default behaviour so all those OS related functions do not
die. Maybe there could be some switch (feature?) which change mode of
encode function to die. And new could can handle and deal with it.

                                                As default value 

for that variable (for encoding) can be used from locale or defaults to
UTF-8 (which is probably most used and sane default value).

These are both crap as defaults. The locale's nominal encoding is quite
likely to be ASCII, and both ASCII and UTF-8 are incapable of generating
certain octet strings as output.

It is not a crap as default. Currently locale encoding is what is used
for such actions. It is used for converting multibyte characters into
octets and vice-versa in other applications.

So if your locale encoding is set to ASCII then more applications are
unable to print on your terminal non-ascii characters.

But as there are too many functions from Unicode space to bytes and more
are in some cases "correct" and more are used, there is no one which
should be used. So when you chose any you still get problems.

Therefore locale encoding is what we can use as it is the only one
information which we have from operating system here.

Thus if filenames are subjected to
either of these encodings then it is impossible for the user to specify
some filenames that are valid at the syscall interface, and if such a
filename actually exists then you run into the above-mentioned decoding
problem. For example, the one-octet string "\xc0" doesn't decode as
either ASCII or UTF-8. The only sane default, if you want to offer this
encoding system, is Latin-1, which behaves as a null encoding on Perl
octet strings.

Latin-1 is not sane as it is unable to handle Unicode strings with
characters above U+0000FF. It wrong as ASCII or UTF-8.

The trouble here really arises because the scheme effectively uses the
encoding in reverse. Normally we use a character encoding to encode a
character string as an octet string, so that we can store those octets
and later read them to recover the original character string. With Unix
filenames, however, the thing that we want to represent and store, which
is the filename as it appears at the OS interface, is an octet string.
The encoding layer, if there is one, is concerned with representing
that octet string as a character string. An encoding that can't handle
all octet strings is a problem, just as in normal circumstances a
character encoding that can't handle all character strings is a problem.
Most character encodings are just not designed to be used in reverse,
and don't have a design goal of encoding to all octet strings or of
decode-then-encode round-tripping.

If we want to handle any Unicode string created in perl and passed to
Perl's open() function we need to use some Unicode transformation
function.

If we want to open arbitrary file stored on disk (in bytes) then we need
to use encoding which maps from whole space of characters to some
Unicode strings.

Both cannot be achieved. And if there is some function it is still not
useful. As file names on disk are already stored in some encoding. Just
kernel do not care about it and even it do not know that encoding.

So user or application (or library or system) must know in which
encoding are stored file names. And this should be present in current
locale.

Therefore I suggest to use default encoding from locale with ability to
change it. So if user has stored files in different encoding as
specified in locale, then user has already problem to handle such files
in applications which uses wchar_t and probably already know how to deal
with it...

Either temporary change locale encoding or passing some argument to perl
(or env variable or perl variable) to specify correct one.

But problem is how currently Perl's open() function is implemented. It
expects bytes or Unicode string?

The current behaviour is broken on any platform. To get to anything sane
we will need a change that breaks some backcompat. In that situation
we are not constrained by the present arrangement of the open() internals.

We can define new use feature 'unicode_filenames' or something like that
and then Perl's open() function can be "fixed".

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 1, 2017

From @pali

On Tuesday 28 February 2017 00​:35​:45 Leon Timmermans wrote​:

On Mon, Feb 27, 2017 at 10​:21 PM, <pali@​cpan.org> wrote​:

Windows has two sets of functions for accessing files. First with -A
suffix which takes file names in encoding of current 8bit codepage.
Second with -W suffix which takes file names in Unicode (more precisely
in Windows variant of UTF-16). With -A functions it is possible to
access only those files which file names contains only characters
available in current 8bit codepage. Internally are all file names stored
in Unicode. So -W functions must be used to have access to any file
name. And therefore for Windows we need Unicode file name in perl open()
function to have access to any file stored on disk.

Linux stores file names in binary octets, there is no encoding or
requirement for Unicode. Therefore to access any file on Linux, Perl's
open() function should takes downgraded/non-Unicode file name.

Which means there is no way to have uniform and same multiplaform
support for file access without hacks.

Correct observations. Except OS X makes this more complicated still​:
it uses UTF-8 encoded bytes, normalized using a non-standard variation
of NFD.

It is not a problem or complicated issue. It just means that OS X uses
also Unicode API, same as Windows. Just uses different representation of
Unicode, say OS X variant of UTF-8. We have no problem here to generate
OS X representation from perl string and vice-versa. It just needs
platform specific code, same as Windows for its variant of UTF-16.

I'm thinking that for Linux we could specify some (hint) variable which
will contains encoding name (it can be hidden in some pragma module...).
And then Perl's open() function can takes Unicode file name and can
convert it to encoding (specified by that variable). As default value
for that variable (for encoding) can be used from locale or defaults to
UTF-8 (which is probably most used and sane default value).

This would allow us to have uniform open() function with takes Unicode
file name on (probably) any platform. I think this is the only sane
approach if Perl want to support Unicode file names.

I would welcome a 'unicode_filenames' feature. I don't think any value
other than binary is sane on Linux though. I think we learned from
perl 5.8.0.

But problem is how currently Perl's open() function is implemented. It
expects bytes or Unicode string?

Both. Neither. Welcome to The Unicode Bug.

So it is time for feature unicode_filenames and fix that bug.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 2, 2017

From zefram@fysh.org

pali@​cpan.org wrote​:

We can use Encode​::encode() function in non-croak mode which replace
invalid characters by some replacement

No, that fucks up the filenames. After such a substituting decode,
re-encoding the result will produce some octet string different from
the original. So if you read a filename from a directory, attempting to
use that filename to address the file will at best fail because it's a
non-existent name. (If you're unlucky then it'll address a *different*
file.)

So if your locale encoding is set to ASCII then more applications are
unable to print on your terminal non-ascii characters.

I don't follow your argument here. You don't seem to be addressing the
crapness of making it impossible to deal with arbitrary filenames at
the syscall interface.

Latin-1 is not sane as it is unable to handle Unicode strings with
characters above U+0000FF. It wrong as ASCII or UTF-8.

My objective isn't to make every Unicode string represent a filename.
My objective is to have every filename represented by some Perl string.
Latin-1 would be a poor choice in situations where it is desired to
represent arbitrary Unicode strings, but it's an excellent choice
for the job of representing filenames. Different jobs have different
requirements, leading to different design choices.

So user or application (or library or system) must know in which
encoding are stored file names. And this should be present in current
locale.

Impossible. The locale model of character encoding (as you treat it
here) is fundamentally broken. The model is that every string in the
universe (every file content, filename, command line argument, etc.) is
encoded in the same way, and the locale environment variable tells you
which universe you're in. But in the real universe, files, filenames,
and so on turn up encoded how their authors liked to encode them, and
that's not always the same. In the real universe we have to cope with
data that is not encoded in our preferred way.

The locale encoding is OK if one treats it strictly as a user
*preference*. What one can do with such a preference without risking
running into uncooperative data is quite limited.

      So if user has stored files in different encoding as

specified in locale, then user has already problem to handle such files

I run in the C locale, which on this system has nominally ASCII encoding
(which is in fact my preferred encoding), and yet I occasionally
run into filenames that are derived from UTF-8 or Latin-1 encoding.
Do you realise how much difficulty I have in dealing with such files?
None at all. For my shell is 8-bit clean, and every program I use just
passes the octet string straight through (e.g., from argv to syscalls).
This is a healthy system.

The only programs I've encountered that have any difficulty with
non-ASCII filenames are two language implementations (Rakudo Perl 6
and GNU Guile 2.0) that I don't use for real work. Both of them have
decided, independently, that filenames must be encodings of arbitrary
Unicode strings. Interestingly, they've reached different conclusions
about what encoding is used​: Guile considers it to be the locale's
nominal encoding, whereas Rakudo reckons it's UTF-8 regardless of locale.
(Rakudo is making an attempt to augment its concept of Unicode strings to
be able to represent arbitrary Unicode strings in a way compatible with
UTF-8, but that's not fully working yet, and I'm not convinced that it can
ever work satisfactorily.) Don't make the same mistake as these projects.

We can define new use feature 'unicode_filenames' or something like that
and then Perl's open() function can be "fixed".

That would be a lexically-scoped effect, which (as mentioned earlier)
loses as soon as a filename crosses a module boundary.

-zefram

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 2, 2017

From zefram@fysh.org

I wrote​:

(Rakudo is making an attempt to augment its concept of Unicode strings to
be able to represent arbitrary Unicode strings in a way compatible with
UTF-8,

Oops, I meant "arbitrary octet strings" there.

-zefram

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 4, 2017

From @pali

On Thursday 02 March 2017 04​:23​:35 Zefram via RT wrote​:

pali@​cpan.org wrote​:

We can use Encode​::encode() function in non-croak mode which replace
invalid characters by some replacement

No, that fucks up the filenames. After such a substituting decode,
re-encoding the result will produce some octet string different from
the original. So if you read a filename from a directory, attempting
to use that filename to address the file will at best fail because
it's a non-existent name. (If you're unlucky then it'll address a
*different* file.)

So if your locale encoding is set to ASCII then more applications
are unable to print on your terminal non-ascii characters.

I don't follow your argument here. You don't seem to be addressing
the crapness of making it impossible to deal with arbitrary
filenames at the syscall interface.

Understood. As wrote in my first email we probably cannot have both
ability to access arbitrary file and having uniform access to files
represented by perl Unicode strings.

Latin-1 is not sane as it is unable to handle Unicode strings with
characters above U+0000FF. It wrong as ASCII or UTF-8.

My objective isn't to make every Unicode string represent a filename.

Basically output from ordinary applications are Unicode file names, not
bytes, which is shown to users.

Same, user enter into file open dialog or into console stdin filename as
sequence of key press which represents some characters (which fully maps
to Unicode) and not sequence of bytes.

Also I want to create file named "ÿ" with perl in same way on Windows
and Linux.

So to have fixed open() we need to be able to represent every perl
Unicode string as file name. (With possibility to fail if underlaying
system is not able to store current file name)

My objective is to have every filename represented by some Perl
string.

I understand... and in current model with perl strings it is impossible.

Latin-1 would be a poor choice in situations where it is
desired to represent arbitrary Unicode strings,

Right!

but it's an
excellent choice for the job of representing filenames. Different
jobs have different requirements, leading to different design
choices.

So user or application (or library or system) must know in which
encoding are stored file names. And this should be present in
current locale.

Impossible. The locale model of character encoding (as you treat it
here) is fundamentally broken.

Yes, it is broken. But problem is that it is used by system
applications... :-(

The locale encoding is OK if one treats it strictly as a user
*preference*. What one can do with such a preference without risking
running into uncooperative data is quite limited.

      So if user has stored files in different encoding as

specified in locale, then user has already problem to handle such
files

I run in the C locale, which on this system has nominally ASCII
encoding (which is in fact my preferred encoding), and yet I
occasionally run into filenames that are derived from UTF-8 or
Latin-1 encoding. Do you realise how much difficulty I have in
dealing with such files? None at all. For my shell is 8-bit clean,
and every program I use just passes the octet string straight
through (e.g., from argv to syscalls). This is a healthy system.

Probably some programs like "ls" is not able to print UTF-8 encoded file
names into your terminal...

The only programs I've encountered that have any difficulty with
non-ASCII filenames are two language implementations (Rakudo Perl 6
and GNU Guile 2.0) that I don't use for real work. Both of them have
decided, independently, that filenames must be encodings of arbitrary
Unicode strings. Interestingly, they've reached different
conclusions about what encoding is used​: Guile considers it to be
the locale's nominal encoding, whereas Rakudo reckons it's UTF-8
regardless of locale. (Rakudo is making an attempt to augment its
concept of Unicode strings to be able to represent arbitrary Unicode
strings in a way compatible with UTF-8, but that's not fully working
yet, and I'm not convinced that it can ever work satisfactorily.)
Don't make the same mistake as these projects.

We can define new use feature 'unicode_filenames' or something like
that and then Perl's open() function can be "fixed".

That would be a lexically-scoped effect, which (as mentioned earlier)
loses as soon as a filename crosses a module boundary.

We need to store "unicode filename" information into perl scalar itself.
And make sure it wont be lost when doing assignment or another string
functions...

Another idea​:

Cannot we create new magic like for vstring which would contains
additional informations for file name? Functions like readdir could
properly create such magic scalars and when passed to open it would
correctly handle it. And like vstring it could contain some string
representation in PV slot, so it would be possible to pass such scalar
into print/warn functions or any XS functions which would not be capable
of that new magic. In magic property could be stored platform/system
dependent settings, like which encoding is used.

This could fix problem of accessing arbitrary file, you just compose
magic scalar (maybe via some function or pragma) in system dependent
representation and then pass it into open(). And also fix problem to
pass any Unicode file name, you compose normal perl Unicode string and
based on some settings it would be converted by open() to system
dependent representation. open() would first try to use magic properties
and if they are not present then it fallback to Encode on content of
string. Maybe usage of Encode needs to be enabled by globally (or
locally).

It is usable? Or are there also problems?

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 4, 2017

From zefram@fysh.org

pali@​cpan.org wrote​:

On Thursday 02 March 2017 04​:23​:35 Zefram via RT wrote​:

My objective is to have every filename represented by some Perl
string.

I understand... and in current model with perl strings it is impossible.

No, it *is* possible, and easy. What's not possible is to do that and
simultaneously achieve your other goal of having almost all Unicode
strings represent some filename in a manner that's conventional for
the platform. One of these goals is more important than the other.

Probably some programs like "ls" is not able to print UTF-8 encoded file
names into your terminal...

It can't print them *literally*, and it handles that issue quite well.
GNU ls(1) pays attention to the locale encoding in a sensible manner,
mainly looking at the character repertoire. In the ASCII locale, by
default it displays a question mark in place of high-half octets, which
clues me in that there's a problematic octet. With the -b option it
represents them as backslash escapes, which if need be I can copy into
a shell $'' construct. Actually tab completion is almost always the
solution to entering the filename at the shell, and the completion that
it generates uses $''. This is a healthy system​: I have no difficulty
in examining and using awkward filenames through my preferred medium
of ASCII.

Cannot we create new magic like for vstring which would contains
additional informations for file name?

No. This would be octet-vs-character distinction all over again;
see several previous discussions on p5p. vstrings kinda work, though
not fully, because we hardly ever perform string operations on version
numbers with an expectation of producing a version number as output.
But we manipulate filenames by string means all the time.

-zefram

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 4, 2017

From @xenu

On Sat, 4 Mar 2017 05​:21​:37 +0000
Zefram <zefram@​fysh.org> wrote​:

pali@​cpan.org wrote​:

On Thursday 02 March 2017 04​:23​:35 Zefram via RT wrote​:

My objective is to have every filename represented by some Perl
string.

I understand... and in current model with perl strings it is impossible.

No, it *is* possible, and easy.

Is it? Remember that we're also talking about Windows.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 4, 2017

From zefram@fysh.org

Tomasz Konojacki wrote​:

Is it? Remember that we're also talking about Windows.

See upthread. The easy way to do it is different on Windows from how
it is on Unix, but in both cases there's an obvious and simple way to
represent all native filenames as Perl strings. The parts that would
be platform-dependent are reasonably well localised within the core;
programs written in Perl wouldn't need to be aware of the difference.

An issue that we haven't yet considered is passing filenames as
command-line arguments. Before Unicode, we could expect something like
open(H, "<", $ARGV[0]) to work. (Well, pre-SvUTF8 Perl didn't have
three-arg open, but apart from the syntax that would work.) Currently
$ENV{PERL_UNICODE} means that a program can't fully predict how argv[]
will be mapped into @​ARGV, but as it happens the Unicode bug in open()
papers over that, so feeding an @​ARGV element directly into open() like
this will still work. (You lose if you perform any string operation on
the way, though.)

In any system with a fixed open(), this probably ought to continue to
work​: a filename supplied as a command-line argument, in the platform's
conventional manner, should yield an @​ARGV element which, if fed to
open() et al, functions as that filename. Unlike the question of
encoding character strings as filenames, Unix does have well-defined
conventions for this, with argv elements and filenames in the syscall
API both being nul-terminated octet strings, and an identity mapping
expected between them.

What about on Windows? What form does argv[] take, in its most native
version? How does one conventionally encode a Unicode filename as a
command-line argument?

-zefram

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 5, 2017

From @pali

On Saturday 04 March 2017 06​:22​:18 you wrote​:

pali@​cpan.org wrote​:

On Thursday 02 March 2017 04​:23​:35 Zefram via RT wrote​:

My objective is to have every filename represented by some Perl
string.

I understand... and in current model with perl strings it is
impossible.

No, it *is* possible, and easy. What's not possible is to do that
and simultaneously achieve your other goal of having almost all
Unicode strings represent some filename in a manner that's
conventional for the platform. One of these goals is more important
than the other.

So it is not possible (at least not easy). See my first post which I
wrote to this bug. For you it is just not important, but it is important
for me + other people too. And what I wrote in first post is a bug which
I would like to see fixed.

As wrote, I want to create file named "ÿ" which is stored in perl
string. And I should be able to do it via perl uniform function without
hacks like $^O.

Cannot we create new magic like for vstring which would contains
additional informations for file name?

No.

Why?

This would be octet-vs-character distinction all over again;

But this is your argument. On Linux it is needed to use octets as file
name to support arbitrary file stored on disk.

see several previous discussions on p5p.

Any pointers?

vstrings kinda work, though
not fully, because we hardly ever perform string operations on
version numbers with an expectation of producing a version number as
output. But we manipulate filenames by string means all the time.

Yes, but what is the problem? It would be magic scalar we all get/set
operations on it could be implemented in platform dependent manner.

Also functions like readdir can correctly prepare such scalar, so if you
modify or directly pass to open, you will open any file correctly.

So what is the problem with this idea?

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 5, 2017

From @pali

On Saturday 04 March 2017 15​:28​:02 you wrote​:

Tomasz Konojacki wrote​:

Is it? Remember that we're also talking about Windows.

See upthread. The easy way to do it is different on Windows from how
it is on Unix, but in both cases there's an obvious and simple way to
represent all native filenames as Perl strings.

You suggest that on Linux we should use only binary octets for file
name. Such thing will not work on Windows, where you need to pass
Unicode string as file names.

So if user want to create file named "ÿ", then it would be needed to do
something like this​:

use utf8;
my $filename = "ÿ";
utf8​::encode($filename) $O^ ne "MSWin32";
open my $file, ">", $filename or die;

(resp. replace utf8​::encode with another function which converts perl
Unicode string to byte octets).

So, this your approach is not useful. As script for creating file named
"ÿ" would need to deal with all platforms and its dependent behaviour.

To solve this problem, you need to be able to pass Unicode string as
file name into open.

What about on Windows? What form does argv[] take, in its most
native version? How does one conventionally encode a Unicode
filename as a command-line argument?

Like other winapi functions, for argv here you have also -A and -W
variants. -A is encoded in current locale and -W in modified UTF-16. So
if you want you can take Unicode string.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 5, 2017

From zefram@fysh.org

pali@​cpan.org wrote​:

So if user want to create file named "ÿ",

You can't do this, because, at the level you're specifying it, this isn't
a well-defined action on Unix. Some encoding needs to be used to turn
the character into an octet string, and there isn't anything intrinsic
to the platform that determines which encoding to use.

The code that you then give is a bit more specific. I think the effect
you're trying to specify in the code is that you use the octet string
"\xc3\xbf" on Unix and the character string "\x{ff}" on Windows. If this
lower-level description is actually what you want to achieve, then you
should expect to need platform-dependent code to do it, because this is
by definition a platform-dependent effect.

You *could* make the top-level program cleaner by hiding the platform
dependence, and on Unix the choice of encoding, in a module. Your program
could then look like

  open my $file, ">", pali_filename_encode("\xff") or die;

The filename encoder translates an arbitrary Unicode string into
a filename in a manner that is conventional for the platform, and
represents the filename as a Perl string in the manner required
for open(). It could well become part of File​::Spec. Note that the
corresponding decoder must fail on some inputs.

-zefram

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 5, 2017

From @pali

On Sunday 05 March 2017 11​:44​:40 you wrote​:

pali@​cpan.org wrote​:

So if user want to create file named "ÿ",

You can't do this, because, at the level you're specifying it, this
isn't a well-defined action on Unix. Some encoding needs to be used
to turn the character into an octet string, and there isn't anything
intrinsic to the platform that determines which encoding to use.

The code that you then give is a bit more specific. I think the
effect you're trying to specify in the code is that you use the
octet string "\xc3\xbf" on Unix and the character string "\x{ff}" on
Windows. If this lower-level description is actually what you want
to achieve, then you should expect to need platform-dependent code
to do it, because this is by definition a platform-dependent effect.

You *could* make the top-level program cleaner by hiding the platform
dependence, and on Unix the choice of encoding, in a module. Your
program could then look like

open my $file\, ">"\, pali\_filename\_encode\("\\xff"\) or die;

The filename encoder translates an arbitrary Unicode string into
a filename in a manner that is conventional for the platform, and
represents the filename as a Perl string in the manner required
for open(). It could well become part of File​::Spec. Note that the
corresponding decoder must fail on some inputs.

-zefram

Exactly! This is what high-level program want to do and achieve. They
really should do not care about low-level OS differences.

Decoder does not have to always fail on non-encodable input. It can e.g.
directly use Encode module and allow caller to specify what to do with
bad input​: https://metacpan.org/pod/Encode#Handling-Malformed-Data

But before we can start implementing such thing (e.g. in File​::Spec
module) we need to have defined API for open() and resolved this bug
("\xFF" eq "\N{U+FF}") which I described in first post. Because now it
is not specified if open() takes Unicode string or byte octets...

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 20, 2018

From @pali

On Tuesday 28 February 2017 00​:35​:45 Leon Timmermans wrote​:

On Mon, Feb 27, 2017 at 10​:21 PM, <pali@​cpan.org> wrote​:

Windows has two sets of functions for accessing files. First with -A
suffix which takes file names in encoding of current 8bit codepage.
Second with -W suffix which takes file names in Unicode (more precisely
in Windows variant of UTF-16). With -A functions it is possible to
access only those files which file names contains only characters
available in current 8bit codepage. Internally are all file names stored
in Unicode. So -W functions must be used to have access to any file
name. And therefore for Windows we need Unicode file name in perl open()
function to have access to any file stored on disk.

Linux stores file names in binary octets, there is no encoding or
requirement for Unicode. Therefore to access any file on Linux, Perl's
open() function should takes downgraded/non-Unicode file name.

Which means there is no way to have uniform and same multiplaform
support for file access without hacks.

Correct observations. Except OS X makes this more complicated still​:
it uses UTF-8 encoded bytes, normalized using a non-standard variation
of NFD.

For completeness​:

Windows uses UCS-2 for file names and also in corresponding WinAPI -W
functions which operates with file names. It is not UTF-16 as file names
may really have unpaired surrogates.

OS X uses non-standard variant of Unicode NFD encoded in UTF-8.

Linux use just binary octets.

Idea how to handle file names in Perl​:

Store file names in extended Perl's Unicode (with code points above
U+1FFFFF). Non-extended code points would represent normal Unicode code
points. And code points above U+1FFFFF would represent parts of file
name which cannot be unambiguously represented in Unicode.

On Linux, take file name (which is char*) and start decoding it from
UTF-8. Sequence of bytes which cannot be decoded as UTF-8 would be
decoded as sequence of extended code points (e.g. U+200000 - U+2000FF).
This operation has inverse therefore can be used for conversion of any
file name stored on Linux system. Plus it is UTF-8 friendly, if
filenames in VFS are stored in UTF-8 (which is now common), then perl's
say function can correctly print them.

On OS X, take file name (which is char* but in UTF-8) and just decode it
from UTF-8. For conversion from Perl's Unicode to char* just do that
non-standard NFD normalization and encode to UTF-8.

On Windows, take file name (wchar_t* which is uint16_t*) compatible for
-W WinAPI function which represents UCS-2 sequence and decode it to
Unicode. There can be unpaired surrogates and represents it either as
Unicode surrogate code points, or use extended Perl's code points (bove
U+1FFFFF). Reverse process (from perl's Unicode to wchar_t*/uint16_t*)
is obvious.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 21, 2018

From @dur-randir

On Mon, 20 Aug 2018 01​:48​:07 -0700, pali@​cpan.org wrote​:

Store file names in extended Perl's Unicode (with code points above
U+1FFFFF). Non-extended code points would represent normal Unicode code
points. And code points above U+1FFFFF would represent parts of file
name which cannot be unambiguously represented in Unicode.

And then someone passes this string to an API call that expects well-formed UTF-8, and everything crashes. Perl core has recently taken a lot of steps to allow only well-formed UTF-8 string to be available to user, and now you suggest to take a step back - I don't think that's a good idea.

It could work if you could separate such strings into their own namespace - but that'd require and API change for all filesystem-related functions.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 21, 2018

From @pali

On Tuesday 21 August 2018 02​:02​:18 Sergey Aleynikov via RT wrote​:

On Mon, 20 Aug 2018 01​:48​:07 -0700, pali@​cpan.org wrote​:

Store file names in extended Perl's Unicode (with code points above
U+1FFFFF). Non-extended code points would represent normal Unicode code
points. And code points above U+1FFFFF would represent parts of file
name which cannot be unambiguously represented in Unicode.

And then someone passes this string to an API call that expects well-formed UTF-8, and everything crashes. Perl core has recently taken a lot of steps to allow only well-formed UTF-8 string to be available to user, and now you suggest to take a step back - I don't think that's a good idea.

It could work if you could separate such strings into their own namespace - but that'd require and API change for all filesystem-related functions.

Yesterday on IRC I presented following idea, which could solve above
problem.

Introduce a new qf operator which takes Unicode string and returns perl
object which would represent file name. Internally object itself can
store file name as it needs (e.g. sequence of integer code points, if
storing code points above U+1FFFFF in UTF-8 string is bad) and every
perl's filesystem function (like open()) would interpret these file name
objects specially -- without The Unicode bug, etc...

Also functions like readdir() would return these file name objects
instead of regular strings.

Those file name objects could have proper stringification operator to
always produce printable string of file name. And for those
non-representable code points above U+1FFFFF, stringification function
can escape it via some ASCII sequences.

This would allow​:
In module ABC to create a file name via qf operator and pass it into
module CDE which calls open() on argument passed from module ABC.

All those fs functions (like open()) would work like before, so there
would not be any regression for existing code. Just when passed argument
is that special object, it would be handled differently.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 22, 2018

From @dur-randir

On Tue, 21 Aug 2018 02​:11​:41 -0700, pali@​cpan.org wrote​:

Introduce a new qf operator which takes Unicode string and returns
perl
object which would represent file name. Internally object itself can
store file name as it needs (e.g. sequence of integer code points, if
storing code points above U+1FFFFF in UTF-8 string is bad) and every
perl's filesystem function (like open()) would interpret these file
name
objects specially -- without The Unicode bug, etc...

Also functions like readdir() would return these file name objects
instead of regular strings.

Yeah, that's a path of changing API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.