RFC: Add byte and byte string literals #69

SimonSapin · 2014-05-05T23:53:09Z

No description provided.

SimonSapin · 2014-05-05T23:53:21Z

Rendered view: https://github.com/SimonSapin/rfcs/blob/ascii-literals/active/0000-ascii-literals.md

SimonSapin · 2014-05-05T23:56:23Z

Previously: rust-lang/rust#4334

Apparently, GitHub’s auto-linking does not apply when rendering in-repo Markdown files.

nrc · 2014-05-06T00:02:51Z

active/0000-ascii-literals.md

+byte string literals of type `&'static [u8]` (or `[u8]`, post-DST).
+They are identical to the existing character and string literals, except that:
+
+* They are prefixed with a `b` (for "binary"), to distinguish them


-1 for b as a prefix - I don't see anything more or less binary about these chars/strs than regular ones

b is taken from Python, but I’m not especially attached to it. I’d be fine with another syntax. How about one of these? a'\t' (a for ASCII), '\t'u8 (the latter doesn’t really work for strings, though)

I prefer the last one. Why wouldn't it work for strings?

chris-morgan · 2014-05-06T00:14:25Z

+1 all round.

+1 for raw strings, though I would use br"" rather than rb"" for consistency with Python. (The arbitrary decision is made in Python that the order is br and not rb; similarly for Unicode string literals, ur and not ru.)

+1 for removing bytes!. It’s become fairly useless anyway with that 'static lifetime issue.

@nick29581 with raw strings having come since the discussion in rust-lang/rust#4334, b"…" is now more consistent rather than less as it was at the time. For 't'u8 vs. b't', there’s still precedent either way.

nrc · 2014-05-06T00:20:47Z

I didn't know we had support for raw strings, so I feel a bit better about a b prefix now.

chris-morgan · 2014-05-06T00:50:48Z

active/0000-ascii-literals.md

+# Unresolved questions
+
+Should there be "raw byte string" literals?
+E.g. `pdf_file.write(rb"<< /Title (FizzBuzz \(Part one\)) >>")`


Python precedent is for allowing br and forbidding rb (syntax error). Also: yes.

Aatch · 2014-05-06T01:11:45Z

I strongly support this RFC. I was actually planning on writing almost exactly the same RFC myself, so thanks @SimonSapin.

Valloric · 2014-05-06T03:40:08Z

Very strong +1. Every day I spend writing Rust I wish it had byte string literals.

jsanders · 2014-05-06T15:14:58Z

👍 This seems really nice, regardless of the specific syntax it ends up being.

bstrie · 2014-05-06T19:26:42Z

I too was going to argue about syntax, but the precedent from Python is good enough for me. +1 on all fronts.

An extra +1 to enforcing br"foo" and disallowing rb"foo". This also makes raw strings look nicer in their extended form: br###"foo"### rather than r###b"foo"###. Please include this in the RFC.

ben0x539 · 2014-05-06T23:25:05Z

Do we really want to 'overload' \x for this? Can we use another escape sequence? If so, we could allow \x, \u and \U in byte strings...

edwardw · 2014-05-09T20:13:13Z

Any chance to borrow some binary pattern matching stuff from Erlang? I find it very powerful and pleasant to use at the same time, e.g. Erlang bit syntax.

SimonSapin · 2014-05-10T00:10:36Z

@edwardw That sounds like a separate RFC. Maybe #29?

SimonSapin · 2014-05-10T00:12:35Z

@ben0x539

Do we really want to 'overload' \x for this?

I do. It follows the precedent of other languages of \x meaning one byte in a byte context.

If so, we could allow \x, \u and \U in byte strings...

It was deliberate to exclude \u and \U in this RFC. What would they even mean?

ben0x539 · 2014-05-11T13:36:28Z

@SimonSapin

I do. It follows the precedent of other languages of \x meaning one byte in a byte context.

Yeah, but it means \x means rather different things in either flavor of string literal. It doesn't follow the precedent of this very same language. :(

It was deliberate to exclude \u and \U in this RFC. What would they even mean?

What they mean in regular string literals. I mean, really byte string literals are just regular string literals without the UTF-8 invariant and hence a different type, the syntax doesn't need to be completely different.

SimonSapin · 2014-05-11T17:16:37Z

Yeah, but it means \x means rather different things in either flavor of string literal.

I don’t see a problem here. This difference is precisely what makes byte literals different from Unicode literals in the first place…

What they mean in regular string literals

Meaning "Just assume UTF-8". I’m opposed to this. The point of working with bytes rather than Unicode is that you don’t necessarily know the encoding (other than it’s ASCII-compatible), so assuming a particular encoding is not appropriate. I could cause Mojibake or other related bugs.

I suppose we have a different vision of what str is. You seem to think of it as a byte string that just happens to hold an invariant of being a valid UTF-8 sequence. I think of it as sequence of Unicode scalar values (roughly: code points) that just happen to be represented in memory as UTF-8 bytes.

pcwalton · 2014-05-13T18:14:45Z

I like this as a potential solution for paths.

Valloric · 2014-05-13T19:39:41Z

\x should be removed in non-byte string literals. I've complained about this before in rust-lang/rust#12769 but now it makes even more sense since it will work as intended in byte string literals and will be actively harmful in utf8 string literals. The only thing \x will do there is confuse users and produce bugs since people will adapt algorithms from C, C++ etc and then forget to add the b prefix.

Removing \x from utf8 literals would prevent a whole series of possible bugs without removing a single shred of functionality because \xXX in utf8 literals is the same as \u00XX.

SimonSapin · 2014-05-13T19:40:27Z

@pcwalton what about paths? Filenames on Unix are fundamentally bytes that should only be interpreted in some encoding (nowadays often UTF-8, but not always, if you have an external hard drive from 1995). But on Windows they’re UTF-16. (Or maybe UCS-2.) I don’t see how byte literals would help std::path.

SimonSapin · 2014-05-13T19:41:43Z

\x should be removed in non-byte string literals

How about restricting it (for Unicode literals) to the ASCII range, where it maps to a single UTF-8 byte?

Valloric · 2014-05-13T19:46:38Z

How about restricting it (for Unicode literals) to the ASCII range, where it maps to a single UTF-8 byte?

I can live with that.

My main concern is people writing something like \xFF in 20 different languages and getting one thing and then writing that in Rust and getting another. I've personally been bitten by this bug.

So if we can restrict the range allowed by \x in utf8 literals to produce byte values that the same \x sequence would produce in a byte string, that's fine. We should then also consider adding a nice compiler error message saying something like "prefix your literal with b to make it a byte string" when the user uses \x outside the allowed range in utf8 literals.

Valloric · 2014-05-13T19:48:07Z

How about restricting it (for Unicode literals) to the ASCII range, where it maps to a single UTF-8 byte?

One thing though... what purpose would that serve? If we restrict it to the ASCII range, you might as well write a instead of \x61.

SimonSapin · 2014-05-13T19:53:30Z

One thing though... what purpose would that server though?

Same as removing it: avoid the debate of rust-lang/rust#2800

If we restrict it to the ASCII range, you might as well write a instead of \x61.

Yeah of course. But you may still want some of the "non-printable" code points of the ASCII range: U+0000 to U+001F and U+007F.

brson · 2014-06-04T19:33:53Z

Thank you for the contribution. Accepted as RFC 23, per https://github.com/mozilla/rust/wiki/Meeting-weekly-2014-06-03. cc rust-lang/rust#14646

SimonSapin · 2014-06-13T20:48:24Z

For the record, I realized while implementing this that the combination of decisions in this RFC have two consequences I did not anticipate:

Byte literals can be used anywhere u8 can, even if it looks nonsensical. E.g.

assert_eq!([42, ..b'\t'].as_slice(), &[42, 42, 42, 42, 42, 42, 42, 42, 42]);

Since unescaped characters in byte strings are limited to ASCII and raw byte strings do not have escape, it is not possible to write a raw byte string containing non-ASCII bytes.

Valloric · 2014-06-13T22:12:09Z

Byte literals can be used anywhere u8 can, even if it looks nonsensical

Doesn't sound like a big deal.

Since unescaped characters in byte strings are limited to ASCII and raw byte strings do not have escape, it is not possible to write a raw byte string containing non-ASCII bytes.

Seems reasonable enough; the limitation is there only for raw byte strings, not plain byte strings. And if you want to put unicode chars in a byte string, you are using escapes either way, so there obviously isn't any sensible reason to want to use a raw byte string.

In other words, the user can't have it both ways; you can't say "I want \uABCD to work but \t to be left alone." The compiler can't read their mind.

ben0x539 · 2014-06-13T23:44:36Z

with the bytes!() macro you could at least switch between raw and cooked strings and plain u8 numbers etc and it'd all get mashed together...

SimonSapin · 2014-06-13T23:48:14Z

@ben0x539 The plan is to remove bytes!(), since it’s redundant with byte string literals.

ben0x539 · 2014-06-14T00:17:50Z

Yeah, what I'm saying is that it isn't entirely because it lets you combine differently typed things into a single block of bytes.

See #14646 (tracking issue) and rust-lang/rfcs#69. This does not close the tracking issue, as the `bytes!()` macro still needs to be removed. It will be later, after a snapshot is made with the changes in this PR, so that the new syntax can be used when bootstrapping the compiler.

netvl · 2014-07-01T21:46:15Z

I noticed that current Rust nightly shows deprecation warning on bytes!() macro usage. But how do I write this literal construction without giving up readability?

        static DATA: &'static [u8] = bytes!(
            0, 0, 0, 0, 0, 0, 0, 3,  // # of paths
            0, 8, "/a/b/c/d",
            0, 0,   // theoretically possible
            0, 1, "/"
        );

With byte string literals it will look like this:

    static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03\0\x08/a/b/c/d\0\0\0\x01/";

It looks awful compared to bytes!() variant. I think that bytes!() macro should be kept to allow things like these.

emberian · 2014-07-01T22:07:25Z

Yeah, I'm not quite convinced that we should remove bytes!() either.

SimonSapin · 2014-07-01T22:41:05Z

@netvl Like in Unicode strings, you can use "escaped newlines", which resolve to nothing: (note the backslashes at the very end of lines.)

    static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03\
                                   \0\x08/a/b/c/d\
                                   \0\0\
                                   \0\x01/";

This is only half of what you asked for in that you can’t have comments in the middle of a literals, but I have to say this looks very unusal. Also, out of context I have no idea what this data represents, so I don’t know what syntax makes sense to you.

Perhaps we could use Python’s idea that consecutive (byte) string literals are concatenated:

    static DATA: &'static [u8] = b"\0\0\0\0\0\0\0\x03"  // # of paths
                                 b"\0\x08/a/b/c/d"
                                 b"\0\0"  // theoretically possible
                                 b"\0\x01/";

Removing bytes!() was one of the "Unresolved questions" since the first revision of this RFC, and the only feedback I got was "Yes". This RFC has since been accepted and implemented. I suggest filing separate issues or RFCs if you want further changes.

netvl · 2014-07-02T15:59:05Z

Escaped newlines certainly make things better, but still bytes!() version is far more readable :(

I'm not sure whether a suggestion to keep bytes!() should be filed as an RFC. As for creating an issue, do you mean rfcs repo or rust itself? As far as I remember, there is no defined process for such things yet.

SimonSapin · 2014-07-03T07:50:40Z

I meant RFCs on this repo or issues on the rust repo. I don’t know which is more appropriate in this case. Maybe chat on IRC with one of the core team to see what they prefer.

To recap:

bytes!() was originally created to express &'static [u8] values based on text rather than numbers.
Byte string literals solve the same problem, but IMO better.
Since they’re redundant, I believe the original, more hackish solution should be removed.
The only case I know of where bytes!() is better is when you want to have non-significant whitespace and comments between parts of the same &'static [u8] value. The other aspect of bytes!() (converting various data types to [u8]) is not so important here.
I can think of a number of ways to address this use case:
- Keep bytes!(). I think this is overkill.
- Python-style concatenated literals that I mentioned above. (Language change.)
- Add a new concat_bytes!() macro (or named something else) similar to concat!(), but that only takes and returns &'static [u8] literals.
- Change existing the concat!() macro to return &'static [u8] when all its arguments are &'static [u8], instead of always returning &'static str.

But as I said, this RFC is done as far as I’m concerned. It’ll be up to you to champion something else through the process. I think anything based on macros is more likely to get accepted than a language change.

Valloric · 2014-07-05T19:00:24Z

There's one use case for bytes!() that I think isn't addressed by the current byte literals support and that's writing non-ASCII chars as a &'static [u8] UTF-8 string. For instance, I have the following in my codebase: bytes!( "葉" ) and I don't see a way to write that with as a byte literal while keeping the original character in the code. This becomes an even bigger problem when you have something like bytes!( "ελληνικά" ). Escaping all the chars by hand makes it extremely hard to comprehend the original text.

SimonSapin · 2014-07-05T19:04:01Z

If it’s not a in "static" context, you can use "ελληνικά".as_bytes().

May I ask why &[u8] is preferred over &str for text that is known to be UTF-8?

Valloric · 2014-07-05T19:14:17Z

If it’s not a in "static" context, you can use "ελληνικά".as_bytes().

But I need it in a static context.

May I ask why &[u8] is preferred over &str for text that is known to be UTF-8?

Precisely because the text isn't known to be UTF-8. Many existing APIs just manipulate byte sequences without caring what's in them. Sometimes those APIs will get ASCII data, sometimes UTF-8, sometimes binary data. An existing networking API for example would be wrapped as accepting a &[u8], not a &str.

My current use case involves talking to an API that takes &[u8], and I have macros and other test code that needs a 'static lifetime. But that's besides the point; Rust needs a way to cover the general use-case of "non-ASCII text in the source code as a sequence of UTF-8 bytes with static lifetime".

There needs to be some way to handle that, otherwise there's a hole. You can get non-ASCII text as &[u8] but without the static bound, and you can get non-ASCII text as 'static or not but the type is &str. We have 3 out of 4 instead of 4 out of 4.

SimonSapin · 2014-07-05T19:22:53Z

I belive that "ελληνικά".as_bytes() also has a static lifetime. By static context I meant in the initializer for a static item rather than a static lifetime. This currently doesn’t work because there is no compile-time evaluation of functions.

That said, I won’t try to block the un-deprecation of bytes!() more than expressing the opinions I already have. But it’s not me you need to convince at this point. I suggest making a new RFC.

brson · 2014-07-09T00:15:53Z

@SimonSapin Is there a followup we need to do here to fix the issues you identified?

RFC: Add byte and byte string literals

f31895d

SimonSapin mentioned this pull request May 5, 2014

bytes!() should encode to ASCII instead of UTF-8 rust-lang/rust#13955

Closed

(Byte literals RFC) Fix lack of Markdown magic

4ea0ec9

Apparently, GitHub’s auto-linking does not apply when rendering in-repo Markdown files.

nrc reviewed May 6, 2014
View reviewed changes

(Byte literals RFC) Raw string prefix precedent

471fbe8

huonw mentioned this pull request May 6, 2014

consider including byte literals (alternative to integer literals for [u8] and u8) rust-lang/rust#4334

Closed

chris-morgan reviewed May 6, 2014
View reviewed changes

lilyball mentioned this pull request May 20, 2014

Remove \xXX char escapes from the language rust-lang/rust#12769

Closed

brson mentioned this pull request Jun 4, 2014

Tracking issue for RFC 23 - byte string literals rust-lang/rust#14646

Closed

brson merged commit 471fbe8 into rust-lang:master Jun 4, 2014

SimonSapin deleted the ascii-literals branch June 4, 2014 23:18

SimonSapin mentioned this pull request Jun 13, 2014

Add byte, byte string, and raw byte string literals. rust-lang/rust#14880

Closed

pnkfelix mentioned this pull request Sep 26, 2014

Remove \xXX char escapes from the language #312

Closed

chriskrycho mentioned this pull request Feb 8, 2017

Document all features in the reference rust-lang/rust#38643

Closed

17 tasks

chriskrycho mentioned this pull request Mar 11, 2017

Document all features rust-lang/reference#9

Closed

48 tasks

Centril added A-syntax Syntax related proposals & ideas A-expressions Term language related proposals & ideas A-string Proposals relating to strings. labels Nov 23, 2018

RFC: Add byte and byte string literals #69

RFC: Add byte and byte string literals #69

Conversation

SimonSapin commented May 5, 2014

SimonSapin commented May 5, 2014

SimonSapin commented May 5, 2014

nrc May 6, 2014

Choose a reason for hiding this comment

SimonSapin May 6, 2014

Choose a reason for hiding this comment

nrc May 6, 2014

Choose a reason for hiding this comment

chris-morgan commented May 6, 2014

nrc commented May 6, 2014

chris-morgan May 6, 2014

Choose a reason for hiding this comment

Aatch commented May 6, 2014

Valloric commented May 6, 2014

jsanders commented May 6, 2014

bstrie commented May 6, 2014

ben0x539 commented May 6, 2014

edwardw commented May 9, 2014

SimonSapin commented May 10, 2014

SimonSapin commented May 10, 2014

ben0x539 commented May 11, 2014

SimonSapin commented May 11, 2014

pcwalton commented May 13, 2014

Valloric commented May 13, 2014

SimonSapin commented May 13, 2014

SimonSapin commented May 13, 2014

Valloric commented May 13, 2014

Valloric commented May 13, 2014

SimonSapin commented May 13, 2014

brson commented Jun 4, 2014

SimonSapin commented Jun 13, 2014

Valloric commented Jun 13, 2014

ben0x539 commented Jun 13, 2014

SimonSapin commented Jun 13, 2014

ben0x539 commented Jun 14, 2014

netvl commented Jul 1, 2014

emberian commented Jul 1, 2014

SimonSapin commented Jul 1, 2014

netvl commented Jul 2, 2014

SimonSapin commented Jul 3, 2014

Valloric commented Jul 5, 2014

SimonSapin commented Jul 5, 2014

Valloric commented Jul 5, 2014

SimonSapin commented Jul 5, 2014

brson commented Jul 9, 2014