New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800
Comments
I think Unicode code points are much more intuitive to work with, not having to deal with implementation details of some specific encoding. |
Our strings definitely are utf8, it's not just "some specific encoding". We're very much exposing that and expecting programmers to know what that means. As they know what 2s complement integers (not auto-expanding-to-bignums) and 754 floating point (not rationals) are and what they do. If you want an array of unicode codepoints, that's That said, I'm somewhat sympathetic to the arguments about which way to do this. Followup is on the list, over in the thread that created this bug: https://mail.mozilla.org/pipermail/rust-dev/2012-July/002024.html Yes, this is part of the "utf8 monoculture" some people despise, but I am somewhat unrepentant about it. I think it's as stable, flexible and long-term an encoding as we're likely to see for years; the only plausible competitor on the horizon is GB18030 and it even covers different codepoints, so it's not really fair to consider it a "different encoding", it's a whole different charset. And, in any case, my experience is that the harm done to language users, especially systems-language users, by being ambiguous about the in-memory meaning of literals in program text far outweighs the harm done by picking some particular unambiguous interpretation. IOW on this topic I think the risk of underspecification is higher than the risk of overspecification. It would be more useful to support multiple-explicit-encodings -- even permit tagging a whole file as written-to-a-different-default-encoding -- if that every becomes a real concern, than to throw our hands up about the encoding and say "strings are implementation-specific!" Incidentally, it should be trivial to write a syntax extension that maps from encoding-to-encoding at compile time, i.e. one that lets you write |
Thank you for the explanation, it makes a lot of sense. |
Definitely. Some machinery exists in core for handling basic tasks associated with strings in the various operating-system-required encodings; more will wind up in libstd, likely a binding to libicu. |
Nominated for backwards compatible |
accepted for backwards-compatible milestone |
I agree that |
cc me |
I disagree. I don’t see a reason to use code units in (Unicode) literal strings. Why would you want If the concern is that If \xNN is still changed to represent code units, literals that contain invalid UTF-8 like |
I agree with that, @SimonSapin. The reason I wanted it was for byte string literals, but at the time |
Yes, I also want byte literals (and found this when searching for that.) |
If |
I guess the strongest argument in favor of code units is Behdad's, that it would make our string escapes compatible with C and Python. |
Behdad's full argument: Here: "\xHH, \uHHHH, \UHHHHHHHH Unicode escapes", I strongly suggest that |
I don’t know about C, but Behdad’s argument does not apply to Python. Python (both in 2.x and 3.x) has two types of strings: byte strings, where \xHH is a byte and \uHHHH is not an escape sequence; and Unicode strings where \xHH is a code point and |
There doesn't seem to be a definitive argument for either side, and since changing these to be code units makes their validation slightly harder, I'm inclined to just leave as-is and call it done. |
@SimonSapin indeed, I think graydon said the same thing in his initial response to Behdad, along with providing a more complete table of what different languages do here (Though that table is missing C# it seems.) So is Rust going to be more like python and scheme, or more like perl, go, C, C++, ruby... ? |
(having said that, I'm fine with brson's suggestion to leave things as they are.) |
In today's meeting we decided to leave this as is. |
There's not a lot of consensus on this between languages, but the C and C++ paths (also perl, go, and at least python3 'bytes' literals, though not 'string') treat this escape as a code unit, not a codepoint.
The text was updated successfully, but these errors were encountered: