Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800

Closed
graydon opened this issue Jul 4, 2012 · 19 comments
Closed

RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800

graydon opened this issue Jul 4, 2012 · 19 comments
Labels
A-grammar Area: The grammar of Rust A-unicode Area: Unicode C-cleanup Category: PRs that clean code up or issues documenting cleanup. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Milestone

Comments

@graydon
Copy link
Contributor

graydon commented Jul 4, 2012

There's not a lot of consensus on this between languages, but the C and C++ paths (also perl, go, and at least python3 'bytes' literals, though not 'string') treat this escape as a code unit, not a codepoint.

@ghost
Copy link

ghost commented Jul 5, 2012

I think Unicode code points are much more intuitive to work with, not having to deal with implementation details of some specific encoding.
If a string consists of UTF-8 code units, then a single character may consist of one to four code units.
So I can have a ten-character string with a length of 40.
Operations like getting a substring can leave you with broken characters, by extracting fewer than all of the code units of a character.
As for other languages, Python used to do different things depending on how it was compiled.
This is fixed as of Python 3.3, and it now supports the full Unicode range without having to deal with surrogate pairs, and string operations are much more intuitive for it.
Can’t think of many examples off-hand, but one other language that defines characters in terms of code points, rather than code units in some specific encoding, is Haskell, at least since Haskell 98.

@graydon
Copy link
Contributor Author

graydon commented Jul 5, 2012

Our strings definitely are utf8, it's not just "some specific encoding". We're very much exposing that and expecting programmers to know what that means. As they know what 2s complement integers (not auto-expanding-to-bignums) and 754 floating point (not rationals) are and what they do. If you want an array of unicode codepoints, that's [char], not str. Likewise if you want a utf16 array, that's a different thing too. Python is actually the wrong precedent here; we're a systems language and users frequently flip into "I know about the in-memory implementation" assumptions, even rely on them.

That said, I'm somewhat sympathetic to the arguments about which way to do this. Followup is on the list, over in the thread that created this bug: https://mail.mozilla.org/pipermail/rust-dev/2012-July/002024.html

Yes, this is part of the "utf8 monoculture" some people despise, but I am somewhat unrepentant about it. I think it's as stable, flexible and long-term an encoding as we're likely to see for years; the only plausible competitor on the horizon is GB18030 and it even covers different codepoints, so it's not really fair to consider it a "different encoding", it's a whole different charset. And, in any case, my experience is that the harm done to language users, especially systems-language users, by being ambiguous about the in-memory meaning of literals in program text far outweighs the harm done by picking some particular unambiguous interpretation. IOW on this topic I think the risk of underspecification is higher than the risk of overspecification. It would be more useful to support multiple-explicit-encodings -- even permit tagging a whole file as written-to-a-different-default-encoding -- if that every becomes a real concern, than to throw our hands up about the encoding and say "strings are implementation-specific!"

Incidentally, it should be trivial to write a syntax extension that maps from encoding-to-encoding at compile time, i.e. one that lets you write utf16! "hello \U0010f0B1" and have it expand at compile time to [0xfeff_u16, 0x0068_u16, 0x0065_u16, 0x006c_u16, 0x006c_u16, 0x006f_u16, 0xd8c3_u16, 0xdcb1_u16], or similar. Just note that this has a different type from str.

@ghost
Copy link

ghost commented Jul 6, 2012

Thank you for the explanation, it makes a lot of sense.
I assumed too much about the str type and its purpose, and appreciate the clarification.
Perhaps there is room in the standard library for a text module of some sort, for doing more high-level work with text?

@graydon
Copy link
Contributor Author

graydon commented Jul 6, 2012

Definitely. Some machinery exists in core for handling basic tasks associated with strings in the various operating-system-required encodings; more will wind up in libstd, likely a binding to libicu.

@pcwalton
Copy link
Contributor

pcwalton commented May 9, 2013

Nominated for backwards compatible

@graydon
Copy link
Contributor Author

graydon commented Jun 6, 2013

accepted for backwards-compatible milestone

@emberian
Copy link
Member

emberian commented Aug 5, 2013

I agree that \xNN should be a code unit and \u... should be code point.

@pnkfelix
Copy link
Member

cc me

@SimonSapin
Copy link
Contributor

I disagree. I don’t see a reason to use code units in (Unicode) literal strings. Why would you want "\xEF\xBF\xBD" rather than "\uFFFD". (Of course, byte string literals are a different story.)

If the concern is that "\xNN" looks like a byte while it’s not, we could only allow it for values in the ASCII range (\x00 to \x7F) i.e. for code points that are represented as one byte in UTF-8.

If \xNN is still changed to represent code units, literals that contain invalid UTF-8 like "\x80" should be compile-time errors so as to not break str’s promise to contain valid UTF-8.

@emberian
Copy link
Member

emberian commented Jan 2, 2014

I agree with that, @SimonSapin. The reason I wanted it was for byte string literals, but at the time bytes!(...) didn't exist and IMO that allows for much nicer literals than \xNN style stuff. I no longer agree with this change.

@SimonSapin
Copy link
Contributor

Yes, I also want byte literals (and found this when searching for that.)

@brson
Copy link
Contributor

brson commented Jan 3, 2014

If \xFF indicates a code unit does that mean that character literals need to support multiple \x escapes? I can't tell yet if there's any precedent for that in other languages.

@brson
Copy link
Contributor

brson commented Jan 3, 2014

I guess the strongest argument in favor of code units is Behdad's, that it would make our string escapes compatible with C and Python.

@brson
Copy link
Contributor

brson commented Jan 3, 2014

Behdad's full argument:

Here: "\xHH, \uHHHH, \UHHHHHHHH Unicode escapes", I strongly suggest that
\xHH be modified to allow inputting direct UTF-8 bytes. For ASCII it doesn't
make any different. For Latin1, it gives the impression that strings are
stored in Latin1, which is not the case. It would also make C / Python
escaped strings directly usable in Rust. Ie. '\xE2\x98\xBA' would be a single
character equivalent to '\u263a', not three Latin1 characters.

@SimonSapin
Copy link
Contributor

I don’t know about C, but Behdad’s argument does not apply to Python. Python (both in 2.x and 3.x) has two types of strings: byte strings, where \xHH is a byte and \uHHHH is not an escape sequence; and Unicode strings where \xHH is a code point and u'\xE2\x98\xBA' is indeed three code points in the Latin1 range.

@brson
Copy link
Contributor

brson commented Jan 18, 2014

There doesn't seem to be a definitive argument for either side, and since changing these to be code units makes their validation slightly harder, I'm inclined to just leave as-is and call it done.

@pnkfelix
Copy link
Member

@SimonSapin indeed, I think graydon said the same thing in his initial response to Behdad, along with providing a more complete table of what different languages do here

(Though that table is missing C# it seems.)

So is Rust going to be more like python and scheme, or more like perl, go, C, C++, ruby... ?

@pnkfelix
Copy link
Member

(having said that, I'm fine with brson's suggestion to leave things as they are.)

@brson
Copy link
Contributor

brson commented Jan 21, 2014

In today's meeting we decided to leave this as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-grammar Area: The grammar of Rust A-unicode Area: Unicode C-cleanup Category: PRs that clean code up or issues documenting cleanup. E-easy Call for participation: Easy difficulty. Experience needed to fix: Not much. Good first issue.
Projects
None yet
Development

No branches or pull requests

6 participants