properly support malformed char literals #44765

simeonschaub · 2022-03-27T05:57:36Z

Fixes the parsing of char literals like '\xc0\x80'. At first, I tried
to replicate the behavior of getindex on a string in Julia here, but
then I noticed that we probably also want to support cases like
'\xff\xff', which would give two characters in a Julia string. Now
this also supports non-sensible UTF-8, but only if written out using
\x...

fixes #25072

Fixes the parsing of char literals like `'\xc0\x80'`. At first, I tried to replicate the behavior of `getindex` on a string in Julia here, but then I noticed that we probably also want to support cases like `'\xff\xff'`, which would give two characters in a Julia string. Now this supports any combination of characters as long as they are between 1 and 4 bytes, so even literals like `'abcd'` are allowed. I think this makes sense because otherwise we wouldn't be able to reparse every valid char in Julia, but I could see Python users being confused about why Julia only supports strings up to length 4... 😄 fixes #25072

Keno · 2022-03-27T17:40:55Z

'abcd'

I don't like that. I feel like for Char purposes, literal, non-escaped chars are basically zero-padded.

simeonschaub · 2022-03-27T17:48:05Z

Yeah, I get what you mean. Unfortunately, to allow this only for escaped char literals, we can't piggyback off the parsing for strings anymore, we'd have to have custom parsing rules for char literals.

Malformed chars now need to be written out explicitly with (multiple) `\x...`.

test/syntax.jl

StefanKarpinski · 2022-03-28T18:28:44Z

@simeonschaub, I want to thank you for fixing a bunch of syntaxy issues like this that have been outstanding for many years and make the language much smoother and more cohesive. Bravo 👏🏻

StefanKarpinski · 2022-03-28T19:06:43Z

Messing about with this I found an interesting corner case: '\xbf\xbf\xbf'. This PR currently allows this and handles it as:

'\xbf\xbf\xbf': Malformed UTF-8 (category Ma: Malformed, bad data)

While you can construct a Char object that contains this data, you can also create one that contains multiple valid characters, such as what @Keno objected to:

julia> c = reinterpret(Char, reverse(map(UInt8, ['a', 'b', 'c', 'd'])))[1]
'\x61\x62\x63\x64': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> println(c)
abcd

IMO allowing one but not the other is inconsistent. If we allow the one with invalid data why do we disallow the one with valid data? I think both should be disallowed. But then what is a sensible criterion for which character literals are allowed? The key issue with both of these is that iterating string data can never produce the character 'abcd' or '\xbf\xbf\xbf'. In other words, a simple approach here would be to parse the character literal as a string and then check if it's a single character (valid or invalid). In other words basically make '...' the equivalent of only("...").

simeonschaub · 2022-03-28T19:31:37Z

Messing about with this I found an interesting corner case: '\xbf\xbf\xbf'. This PR currently allows this and handles it as:
'\xbf\xbf\xbf': Malformed UTF-8 (category Ma: Malformed, bad data)

Yes, that is the intended behavior!

The key issue with both of these is that iterating string data can never produce the character 'abcd' or '\xbf\xbf\xbf'. In other words, a simple approach here would be to parse the character literal as a string and then check if it's a single character (valid or invalid). In other words basically make '...' the equivalent of only("...").

While that's true, I personally really like that with this PR, we can represent any possible Julia Char as a literal. In fact, we already assume this when printing these characters, so it does seem useful to me to have syntax for this.

I thought what Keno was saying is that something like 'abcd' to everyone looks like 4 characters, whereas for '\xbf\xbf\xbf', you already need a good understanding of UTF-8 to be able to tell whether that gives you one character. So I think at that point it isn't too confusing to extend the syntax to non-sensible UTF-8 in the case of character literals.

StefanKarpinski · 2022-03-28T21:21:35Z

I think it's worth stepping back and thinking about why we want the ability to express invalid Char values as literals. It's useful to be able to write invalid Char values like '\xff' or '\x80' or '\0xc0\x80' because you can encounter them when working with actual strings. You might, for example, read some data in that is UTF-8 with '\xff' as a field separator, in which case you could read lines and splits them on the '\xff' character, even though it's not a valid UTF-8 character. I could come up with more examples, but the point is that the goal is not to have a literal syntax for every possible memory that a Char could take in memory, it is to have literal syntax for Char values that might be useful to work with as characters. And that is precisely the ones that might be iterated by a string.

Moreover, there is an implicit invariant for the Char type, which is that it only ever holds data for a single character as defined by string iteration. You can violate that invariant and construct Char values like 'abcd' or \x80\x80' by using reinterpret, but we really don't want to encourage it and we certainly shouldn't provide syntax to construct Char values like that. The same invariant argument doesn't apply to characters like '\xff' or '\x80' or '\0xc0\x80' because there is already an official way to get these Char values without reinterpret, i.e. "\xff"[1], "\x80"[1] and "\0xc0\x80"[1]. All that this PR should do, IMO, is align the set of Char values that you can express with character liter and string literal syntax.

simeonschaub · 2022-03-28T22:29:58Z

I think I still disagree with this. I really mostly think of Char as a type on its own right, which represents a superset of values you might get from iterating a UTF-8 encoded string. Just because you can't get a certain Char from iterating a string doesn't mean it's not useful, they may be used as a sentinel value for example.

OTOH, I don't really see what we gain from declaring these Chars to be generally ill-defined. We'd also have to change their printing to something like reinterpret(Char, 0x80800000), which is a lot more verbose and IMO will read very awkwardly among other valid Chars in an array for example.

Maybe another way of thinking about it: to me \x80 just represents a byte. If you put those bytes into a string, that string will of course iterate a certain way, but I would say that's not really something fundamental about the \x80. The same way, you can also fill a Char/UInt32 with bytes -- you start with the top-most of the 4 bytes and fill the remaining bytes with zeros.

JeffBezanson · 2022-03-31T18:55:20Z

I completely agree with Stefan. Char is already sketchy-but-very-useful for representing invalid utf-8, we should not make it even more complex by allowing things that wouldn't even be possible in that (highly permissive) context. There are likely to be negative consequences down the road for adding more weird corner cases that Char code needs to support.

vtjnash · 2022-03-31T20:47:57Z

IIUC, Char already supports those, since its domain is strictly 0..2^32. That is pre-decided. The question posed here is whether the parser should support forming those characters directly, or require forming them via reinterpret. Relatedly, there is the question of whether to show those non-unicode characters using character syntax with '\xff\xff\xff\xff' (as we show it now) or require writing it as typemax(Char) == reinterpret(Char, typemax(UInt32)) (as we parse it now)

StefanKarpinski · 2022-04-02T21:06:00Z

I would be in favor of also showing those characters as reinterpret calls. I'd suggest erroring but that would slow a lot of code down (even if the error doesn't happen). Another option would be to ignore trailing junk in Char values and consider 'abcd' equivalent to 'a' but that would complicate equality checking etc.

JeffBezanson · 2022-04-14T19:30:59Z

I will try to implement this to work the way we want.

simeonschaub · 2022-04-15T05:53:36Z

I will try to implement this to work the way we want.

I think I already implemented most of that in a separate branch. Let me fix that up

Alternative to #44765. This disallows character literals that can not be created from iterating a UTF-8 string. fixes #25072

JeffBezanson · 2022-04-26T22:00:57Z

Great. Closing in favor of #44989

Make the syntax for character literals the same as what is allowed in single-character string literals. Alternative to #44765 fixes #25072

simeonschaub added parser Language parsing and surface syntax domain:unicode Related to unicode characters and encodings labels Mar 27, 2022

simeonschaub requested review from JeffBezanson and stevengj March 27, 2022 05:57

simeonschaub force-pushed the sds/char_literals branch from f20a3a7 to 6c955c6 Compare March 27, 2022 15:08

fix tests, give more useful test errors

83d9033

DilumAluthge requested a review from StefanKarpinski March 27, 2022 17:51

disallow multiple non-escaped chars

532596a

Malformed chars now need to be written out explicitly with (multiple) `\x...`.

simeonschaub commented Mar 27, 2022

View reviewed changes

test/syntax.jl Show resolved Hide resolved

simeonschaub added 2 commits March 27, 2022 16:54

fix octal escape codes

f9e4eac

some cleanup

ae12249

simeonschaub added the status:triage This should be discussed on a triage call label Mar 29, 2022

JeffBezanson removed the status:triage This should be discussed on a triage call label Apr 14, 2022

simeonschaub added a commit that referenced this pull request Apr 15, 2022

support malformed characters, the second

71683df

Alternative to #44765. This disallows character literals that can not be created from iterating a UTF-8 string. fixes #25072

simeonschaub mentioned this pull request Apr 15, 2022

support malformed char literals, the second #44989

Merged

JeffBezanson closed this Apr 26, 2022

DilumAluthge deleted the sds/char_literals branch April 26, 2022 22:04

simeonschaub restored the sds/char_literals branch April 27, 2022 08:07

giordano deleted the sds/char_literals branch May 13, 2022 22:42

JeffBezanson pushed a commit that referenced this pull request May 19, 2022

support malformed chars in char literal syntax

16e37c2

Make the syntax for character literals the same as what is allowed in single-character string literals. Alternative to #44765 fixes #25072

JeffBezanson pushed a commit that referenced this pull request May 24, 2022

support malformed chars in char literal syntax

a1ce793

Make the syntax for character literals the same as what is allowed in single-character string literals. Alternative to #44765 fixes #25072

KristofferC pushed a commit that referenced this pull request May 25, 2022

support malformed chars in char literal syntax (#44989)

991190f

Make the syntax for character literals the same as what is allowed in single-character string literals. Alternative to #44765 fixes #25072

simeonschaub mentioned this pull request Jan 15, 2023

incorrect search results for malformed Char #48283

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

properly support malformed char literals #44765

properly support malformed char literals #44765

simeonschaub commented Mar 27, 2022 •

edited

Loading

Keno commented Mar 27, 2022

simeonschaub commented Mar 27, 2022

StefanKarpinski commented Mar 28, 2022

StefanKarpinski commented Mar 28, 2022 •

edited

Loading

simeonschaub commented Mar 28, 2022 •

edited

Loading

StefanKarpinski commented Mar 28, 2022 •

edited

Loading

simeonschaub commented Mar 28, 2022 •

edited

Loading

JeffBezanson commented Mar 31, 2022

vtjnash commented Mar 31, 2022

StefanKarpinski commented Apr 2, 2022

JeffBezanson commented Apr 14, 2022

simeonschaub commented Apr 15, 2022

JeffBezanson commented Apr 26, 2022

properly support malformed char literals #44765

properly support malformed char literals #44765

Conversation

simeonschaub commented Mar 27, 2022 • edited Loading

Keno commented Mar 27, 2022

simeonschaub commented Mar 27, 2022

StefanKarpinski commented Mar 28, 2022

StefanKarpinski commented Mar 28, 2022 • edited Loading

simeonschaub commented Mar 28, 2022 • edited Loading

StefanKarpinski commented Mar 28, 2022 • edited Loading

simeonschaub commented Mar 28, 2022 • edited Loading

JeffBezanson commented Mar 31, 2022

vtjnash commented Mar 31, 2022

StefanKarpinski commented Apr 2, 2022

JeffBezanson commented Apr 14, 2022

simeonschaub commented Apr 15, 2022

JeffBezanson commented Apr 26, 2022

simeonschaub commented Mar 27, 2022 •

edited

Loading

StefanKarpinski commented Mar 28, 2022 •

edited

Loading

simeonschaub commented Mar 28, 2022 •

edited

Loading

StefanKarpinski commented Mar 28, 2022 •

edited

Loading

simeonschaub commented Mar 28, 2022 •

edited

Loading