-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[meta] Win32 Unicode support #17094
Comments
From @tonycozCreated by @tonycozPerl typically uses so called "ANSI" APIs, due to compatibility chcp 65001 isn't a solution, since pretty much everything but So this is a meta ticket covering tickets improving Perl's Win32 Any changes need to be switchable (so a user with old ANSI I can see at least the following issues: - command line arguments - filenames across many different operators - process creation (system, exec, readpipe/qx(), pipe open()) - environment variables - console (maybe) - many similar changes in bundled modules Discussion is welcome, patches that satsify the above Perl Info
|
From bokutin@bokut.inI've been waiting. Python and Ruby are already possible. I think it will not be helpful, I have built in the past by myself. |
The RT System itself - Status changed from 'new' to 'open' |
From @pali
Same problem is also on Linux systems. I already created ticket (PS: Please CC me for future discussion as I do not know how can I add |
PlatformsThis ticket is specifically about Win32, but ideally we should do the same for POSIX-like systems. AimProvide support for accessing filenames, program arguments and the environment as strings of characters rather than as strings of bytes, and support Win32's unicode interfaces. This would need to handle possibly mis-encoded filenames, environment entries and argv entries on both Win32 and non-Win32 systems. This would be opt-inThe new behaviour would only occur under a new User visible behaviourWith the new flag, as follows. Win32
POSIX-like systems
The UTF-8 flagNo promises are made on the value of the flag, it may be set for strings representable as ISO 11859-1, or it may not. Handling of mis-encoded namesOutside of Win32, the strings returned by POSIX readdir() and found in argv[] and the environment can contain non-UTF-8 byte sequences. On Win32, filenames may contain lone surrogates, which cannot properly be converted to UTF-8. When translating from bytes to characters non UTF-8 sequences such as overlongs or extended UTF-8 sequences (surrogates, code points above 0x10FFFF) will be treated as invalidly encoded bytes. Invalidly encoded bytes will be encoded as code point On Win32, lone surrogates will be encoded as code point This will be reversed when calling the underlying file API. Extended characters outside these ranges will result in a file not found error or perhaps EINVAL. XS/APINew APIs will be provided:
either follow the current behaviour, or perform the filename conversions discussed above if perl was invoked with -Cs. RationaleWhy not make the behaviour lexicalFilename are passed around between perl lexical scopes, consider filename received from File::Find, or passed to IO::File. I use supers/overlongs in my filenames, this will break thatDon't use this option, it's intended to support displayable filenames transparently. Why do all the fancy decoding?POSIX filenames are byte strings, they may not be valid UTF-8, and Win32 has a similar issue with lone surrogates. Simply storing the byte sequences with the UTF-8 flag on would produce invalid internal state for perl. This is less of an issue for Win32's lone surrogates, but treating them as valid seems incorrect to me. References[PEP383] https://www.python.org/dev/peps/pep-0383/ - "PEP 383 -- Non-decodable Bytes in System Character Interfaces." This makes the faulty assemption that Win32 filenames are Unicode. |
IMO there should be a mechanism to enable this behavior other than the commandline flag. The current |
I think it should be a completely separate CLI flag, that is, not a subflag of
CLI flags can be set via Anyway, I generally agree with this proposal, however I don't understand why you don't want to use "Low Surrogates" and "High Surrogates" blocks for unpaired surrogates, as specified by WTF-8. What do we gain by keeping them illegal? |
On second thought, a global variable (in addition to the flag) probably would make sense but I'm worried that people will want to abuse it with |
Unpaired "Low Surrogates" and "High Surrogates" are illegal in UNICODE. So I think we should avoid using it and also add posibility to detect between real unpaired surrogates which comes from other places (and are illegal) and from win32 filenames (which are legal). |
I would suggest to have
Same here, I would suggest to have it in UNICODE (not UTF-8, not UTF-16).
And same here.
Earlier I suggested solution for this issue: Add a new SV flag (or any other way to mark particular SV*) that indicates that its value is UNICODE version of filename. And lexical pragma could change behavior of Perl that for filenames it sets this new flag, to ensure that e.g. result from readdir() will be UNICODE also when stored into variable (SV*) and used outside of the lexical block. |
argv[] is a C variable, it has no unicode flag
Maybe I wasn't clear enough here. The intent in each case is that @argv (with -CA), %ENV and the result of readdir() would be populated in utf8 from the wide version of the system APIs per the description below, and if necessary upgraded. Saying "in UNICODE" here without saying what you mean in terms of an effect on the implementation isn't meaningful.
You don't say what happens when such a SV is combined with an SV obtained while the unicode interface isn't active. Or how it's combined with SVs not from readdir(). I would see this requiring two flags to prevent problems, one to indicate it came from readdir() and another to indicate the mode it was read in. Also such a flag would be lost when the value is serialized (JSON, as a database column value, etc), making this effectively global (it's up to the developer to make it global across interacting processes) avoids that. |
Ou, sorry for that. C variables of course in UTF-8.
By UNICODE I that Perl scalars would contain sequence of UNICODE code points. By UTF-8 I mean that scalar would contain sequence of UTF-8 bytes. For example letter á in UNICODE is "\N{U+E1}" and in UTF-8 is "\x{c3}\x{a1}". I hope it is clear now.
This is something which needs to be discussed and designed. I just want to show that it is possible to design and implement it. Of course it is not easy and there are lot of edge cases...
Without fixing serializers and extending it, it would not work. But at least it would work for pure-perl code and just "pragma" can be intially marked as experimental to provide at least something and in later versions fixing it / extending until we come up with the stabilized implementation. I'm just trying to show that it is possible to fix this issue. |
I can see being able to use a global variable to control filename handling (the result of readdir(), and how open, rename, unlink etc handle names provided, but any problems encountered when a filename crosses this boundary would be the user's problem. The command-line option is needed to correctly setup @argv and %ENV, and to control what happens when %ENV is modified. I don't think the %ENV handling should be controllable at runtime. A command-line option is actually on the late side, we need argv[] properly setup on Win32 for |
Migrated from rt.perl.org#134286 (status was 'open')
Searchable as RT134286$
The text was updated successfully, but these errors were encountered: