RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800

graydon · 2012-07-04T18:48:16Z

There's not a lot of consensus on this between languages, but the C and C++ paths (also perl, go, and at least python3 'bytes' literals, though not 'string') treat this escape as a code unit, not a codepoint.

ghost · 2012-07-05T14:38:04Z

I think Unicode code points are much more intuitive to work with, not having to deal with implementation details of some specific encoding.
If a string consists of UTF-8 code units, then a single character may consist of one to four code units.
So I can have a ten-character string with a length of 40.
Operations like getting a substring can leave you with broken characters, by extracting fewer than all of the code units of a character.
As for other languages, Python used to do different things depending on how it was compiled.
This is fixed as of Python 3.3, and it now supports the full Unicode range without having to deal with surrogate pairs, and string operations are much more intuitive for it.
Can’t think of many examples off-hand, but one other language that defines characters in terms of code points, rather than code units in some specific encoding, is Haskell, at least since Haskell 98.

graydon · 2012-07-05T20:02:42Z

Our strings definitely are utf8, it's not just "some specific encoding". We're very much exposing that and expecting programmers to know what that means. As they know what 2s complement integers (not auto-expanding-to-bignums) and 754 floating point (not rationals) are and what they do. If you want an array of unicode codepoints, that's [char], not str. Likewise if you want a utf16 array, that's a different thing too. Python is actually the wrong precedent here; we're a systems language and users frequently flip into "I know about the in-memory implementation" assumptions, even rely on them.

That said, I'm somewhat sympathetic to the arguments about which way to do this. Followup is on the list, over in the thread that created this bug: https://mail.mozilla.org/pipermail/rust-dev/2012-July/002024.html

Yes, this is part of the "utf8 monoculture" some people despise, but I am somewhat unrepentant about it. I think it's as stable, flexible and long-term an encoding as we're likely to see for years; the only plausible competitor on the horizon is GB18030 and it even covers different codepoints, so it's not really fair to consider it a "different encoding", it's a whole different charset. And, in any case, my experience is that the harm done to language users, especially systems-language users, by being ambiguous about the in-memory meaning of literals in program text far outweighs the harm done by picking some particular unambiguous interpretation. IOW on this topic I think the risk of underspecification is higher than the risk of overspecification. It would be more useful to support multiple-explicit-encodings -- even permit tagging a whole file as written-to-a-different-default-encoding -- if that every becomes a real concern, than to throw our hands up about the encoding and say "strings are implementation-specific!"

Incidentally, it should be trivial to write a syntax extension that maps from encoding-to-encoding at compile time, i.e. one that lets you write utf16! "hello \U0010f0B1" and have it expand at compile time to [0xfeff_u16, 0x0068_u16, 0x0065_u16, 0x006c_u16, 0x006c_u16, 0x006f_u16, 0xd8c3_u16, 0xdcb1_u16], or similar. Just note that this has a different type from str.

ghost · 2012-07-06T22:29:18Z

Thank you for the explanation, it makes a lot of sense.
I assumed too much about the str type and its purpose, and appreciate the clarification.
Perhaps there is room in the standard library for a text module of some sort, for doing more high-level work with text?

graydon · 2012-07-06T22:44:06Z

Definitely. Some machinery exists in core for handling basic tasks associated with strings in the various operating-system-required encodings; more will wind up in libstd, likely a binding to libicu.

pcwalton · 2013-05-09T17:54:00Z

Nominated for backwards compatible

graydon · 2013-06-06T17:30:54Z

accepted for backwards-compatible milestone

emberian · 2013-08-05T17:00:56Z

I agree that \xNN should be a code unit and \u... should be code point.

pnkfelix · 2013-09-26T10:58:05Z

cc me

SimonSapin · 2014-01-02T19:26:07Z

I disagree. I don’t see a reason to use code units in (Unicode) literal strings. Why would you want "\xEF\xBF\xBD" rather than "\uFFFD". (Of course, byte string literals are a different story.)

If the concern is that "\xNN" looks like a byte while it’s not, we could only allow it for values in the ASCII range (\x00 to \x7F) i.e. for code points that are represented as one byte in UTF-8.

If \xNN is still changed to represent code units, literals that contain invalid UTF-8 like "\x80" should be compile-time errors so as to not break str’s promise to contain valid UTF-8.

emberian · 2014-01-02T20:54:53Z

I agree with that, @SimonSapin. The reason I wanted it was for byte string literals, but at the time bytes!(...) didn't exist and IMO that allows for much nicer literals than \xNN style stuff. I no longer agree with this change.

SimonSapin · 2014-01-02T21:03:08Z

Yes, I also want byte literals (and found this when searching for that.)

brson · 2014-01-03T02:04:58Z

If \xFF indicates a code unit does that mean that character literals need to support multiple \x escapes? I can't tell yet if there's any precedent for that in other languages.

brson · 2014-01-03T02:09:03Z

I guess the strongest argument in favor of code units is Behdad's, that it would make our string escapes compatible with C and Python.

brson · 2014-01-03T02:10:11Z

Behdad's full argument:

Here: "\xHH, \uHHHH, \UHHHHHHHH Unicode escapes", I strongly suggest that
\xHH be modified to allow inputting direct UTF-8 bytes. For ASCII it doesn't
make any different. For Latin1, it gives the impression that strings are
stored in Latin1, which is not the case. It would also make C / Python
escaped strings directly usable in Rust. Ie. '\xE2\x98\xBA' would be a single
character equivalent to '\u263a', not three Latin1 characters.

SimonSapin · 2014-01-03T09:42:31Z

I don’t know about C, but Behdad’s argument does not apply to Python. Python (both in 2.x and 3.x) has two types of strings: byte strings, where \xHH is a byte and \uHHHH is not an escape sequence; and Unicode strings where \xHH is a code point and u'\xE2\x98\xBA' is indeed three code points in the Latin1 range.

brson · 2014-01-18T07:34:32Z

There doesn't seem to be a definitive argument for either side, and since changing these to be code units makes their validation slightly harder, I'm inclined to just leave as-is and call it done.

pnkfelix · 2014-01-18T07:43:01Z

@SimonSapin indeed, I think graydon said the same thing in his initial response to Behdad, along with providing a more complete table of what different languages do here

(Though that table is missing C# it seems.)

So is Rust going to be more like python and scheme, or more like perl, go, C, C++, ruby... ?

pnkfelix · 2014-01-18T07:43:21Z

(having said that, I'm fine with brson's suggestion to leave things as they are.)

brson · 2014-01-21T18:49:37Z

In today's meeting we decided to leave this as is.

graydon mentioned this issue Mar 5, 2013

(still broken) str: stop encoding invalid code points #5151

Closed

brson closed this as completed Jan 21, 2014

alexcrichton mentioned this issue Mar 9, 2014

Remove \xXX char escapes from the language #12769

Closed

SimonSapin mentioned this issue May 13, 2014

RFC: Add byte and byte string literals rust-lang/rfcs#69

Merged

pnkfelix mentioned this issue Sep 26, 2014

Remove \xXX char escapes from the language rust-lang/rfcs#312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800

RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800

graydon commented Jul 4, 2012

ghost commented Jul 5, 2012

graydon commented Jul 5, 2012

ghost commented Jul 6, 2012

graydon commented Jul 6, 2012

pcwalton commented May 9, 2013

graydon commented Jun 6, 2013

emberian commented Aug 5, 2013

pnkfelix commented Sep 26, 2013

SimonSapin commented Jan 2, 2014

emberian commented Jan 2, 2014

SimonSapin commented Jan 2, 2014

brson commented Jan 3, 2014

brson commented Jan 3, 2014

brson commented Jan 3, 2014

SimonSapin commented Jan 3, 2014

brson commented Jan 18, 2014

pnkfelix commented Jan 18, 2014

pnkfelix commented Jan 18, 2014

brson commented Jan 21, 2014

RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800

RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800

Comments

graydon commented Jul 4, 2012

ghost commented Jul 5, 2012

graydon commented Jul 5, 2012

ghost commented Jul 6, 2012

graydon commented Jul 6, 2012

pcwalton commented May 9, 2013

graydon commented Jun 6, 2013

emberian commented Aug 5, 2013

pnkfelix commented Sep 26, 2013

SimonSapin commented Jan 2, 2014

emberian commented Jan 2, 2014

SimonSapin commented Jan 2, 2014

brson commented Jan 3, 2014

brson commented Jan 3, 2014

brson commented Jan 3, 2014

SimonSapin commented Jan 3, 2014

brson commented Jan 18, 2014

pnkfelix commented Jan 18, 2014

pnkfelix commented Jan 18, 2014

brson commented Jan 21, 2014