Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax string literal rules #29

Open
ghost opened this issue Oct 4, 2016 · 25 comments
Open

Relax string literal rules #29

ghost opened this issue Oct 4, 2016 · 25 comments

Comments

@ghost
Copy link

ghost commented Oct 4, 2016

The current rules of Standard ML forbid the direct inclusion of any characters outside of ASCII within a string literal. I propose that the rules be relaxed in Successor-ML so that conforming implementations may allow string literals to contain characters outside of ASCII for character encodings that are ASCII compatible, such as single byte encodings like ISO-8859-1 and multi-byte encodings like UTF-8. This would probably also necessitate changes in the Standard Basis' concept of strings, perhaps recording the exact encoding used in a particular string. Thoughts?

@JohnReppy
Copy link
Contributor

I've got a draft proposal for allowing strings and comments to include UTF-8 characters mostly written. I got stuck on the issue of what code points to allow in strings, since one wants the rendering of the literal to reflect its content (e.g., we don't allow tabs in strings, since they cannot be distinguished from spaces). I don't think that you can support both ISO-8859-1 and UTF-8, since it would introduce ambiguity in rendering.
BTW, SML/NJ already allows non-ascii characters in strings.

@igstan
Copy link

igstan commented Oct 4, 2016

@JohnReppy what kind of ambiguity are you referring to with regards to mixing ISO-8859-1 and UTF-8?

@JohnReppy
Copy link
Contributor

JohnReppy commented Oct 4, 2016

If we see the byte sequence 0xC2 0xA9, should we render that as the ISO-8859-1 string "©" or the UTF-8 character "©"?
I think that UTF-8 has won the character encoding war and that it is the best choice for encoding multibyte characters in 8-bit character strings.

@ghost
Copy link
Author

ghost commented Oct 4, 2016

True, it has. I have nothing against focusing exclusively on UTF-8 support, but I was hoping there would be a means of at least converting strings from legacy encodings like ISO-8859-1. How would we handle chars and strings going forward then? We'd either need to rewrite the string and char as they exist, or invent a new one just for unicode. Hm.

@MatthewFluet
Copy link
Contributor

With regards to which code points to allow in UTF8 string literals, doesn't any sort of isPrint or isWhiteSpace function for Unicode basically require loading in a whole Unicode description table? That seems rather heavyweight (without also committing to comprehensive Unicode support). Also there would be other issues, like should a string literal be allowed to start with a combining character (without the character which it modifies)?

@ghost
Copy link
Author

ghost commented Oct 6, 2016

After consulting the libutf8proc library, that does appear to be the case. It does use large lookup tables, like here: http://git.netsurf-browser.org/libutf8proc.git/tree/src/utf8proc_data.c

Is this a significant problem? I put forth this proposal because I thought that Successor ML needs to gradually acquire at least some form of basic unicode support, not necessarily complete support. I was hoping for at least the ability to handle non-English strings as well, without having to resort to escape sequences. It just seemed like a good idea to help give Successor ML more relevance to the present.

@JohnReppy
Copy link
Contributor

I think that it is a cost that we will have to pay. Given that the MLton compiler is over 30Mb and SML/NJ is almost 11Mb, I don't think that table size will be a show stopper.

@ratmice
Copy link
Contributor

ratmice commented Oct 6, 2016

The size of the compiler is one thing, but Matthew's example seems to indicate this would have to be included in the size of the runtime, which for me at least makes it a tougher pill to swallow than if it just increased compiler sizes.

@JohnReppy
Copy link
Contributor

Only if you are using the Unicode features. There is no reason that the table would have to be included for all executables.

@ghost
Copy link
Author

ghost commented Oct 6, 2016

@JohnReppy, to ensure that, perhaps it would be best to retain the old ASCII string implementation (compatibility reasons mainly) and add a new API for unicode "chars" and "strings". That is how it has been handled in C for the most part.

@JohnReppy
Copy link
Contributor

JohnReppy commented Oct 6, 2016

There are several distinct issues here. First, the representation of the Char and String types. The SML Basis Library does allow for wide implementations of these, but I don't know if any existing implementation of SML supports WideText. There is also the question of allowing UTF-8 multibyte characters in string literals. The presence of such characters does not require any runtime support, but it does require that the compiler accept them (and possibly check them for validity). Lastly, there is the question of supporting Unicode property checking (either on UTF-8 characters or wide characters) and translations between encodings. That is the place where there is a need for the Unicode property tables. This third case will require defining some new Basis Library APIs.

In the near term, I am most interested in the support for UTF-8 multibyte sequences being allowed in string literals (and in comments). I've done some work on a proposal writeup, but I got stuck on specifying what would be allowed as a valid sequence in a string literal.

@JohnReppy
Copy link
Contributor

BTW, I just compiled the utf8proc code linked to above. The resulting TEXT size is under 300Kb, so it is big, but not ridiculously so (<1% the size of the MLton compiler executable on the same machine).

@ghost
Copy link
Author

ghost commented Oct 6, 2016

I could see that being a problem for some devices, but the type of devices SML runs on usually have magnitudes more RAM than that. Therefore, I'm not really sure what point the people talking about the size of the lookup tables are trying to make.

@MatthewFluet
Copy link
Contributor

@JohnReppy I agree with your three distinct issues. MLton does have very basic support for Char16/String16 and Char32/String32, which are used to implement a structure WideChar : CHAR, but the meaning of a lot of the operations is murky.

I agree that Char and String should continue to be 8 byte elements and sequences of (any) 8 byte elements.

My main comment was about how to support the second issue (UTF-8 multibyte chars in string literals) without also dealing with the third issue (comprehensive Unicode APIs). It seems that any non-trivial specification of valid UTF8 sequences in a string literal would be driven by Unicode properties, so would need (at least a basic form) of those APIs. Whereas SML/NJ's current support (i.e., just accept any character with code >= 128) or a slightly different, but related, support (i.e., accept any proper 2-, 3-, or 4-byte UTF8-like sequence, without checking for complete UTF8-validity); see MLton/mlton#117 (comment). This "weak" support seems to cover the most common use case (see the glyphs somewhere else and cut-n-paste it into a string literal) without being overly complicated.

Also, with regards to Unicode tables, my concern isn't so much the size of the tables, but rather that it is another external dependency that would need to be added to the compiler (even if not exported through Basis Library APIs).

@elijahdorman
Copy link

In an international world, first-class support for all languages is a must-have.

Why not extend the syntax and keep current strings as-is (great for backward compatibility). I think JS in particular offers a nice example (syntax-wise). Use backticks to delineate UTF-8 literals, make them multi-line by default, and allow inline interpolation expressions. Indexed access returns either a byte array of a grapheme or a grapheme record (containing the related/non-char codes, byteLength, baseChar, etc). The developer may then choose whether ascii or utf8 is correct for a given project.

@JohnReppy
Copy link
Contributor

I think that we can extend the current string literal syntax to allow UTF-8 literals in strings (i.e., I do not think that we need new literal syntax), but we need to make sure that string literals that appear equal are, in fact, equal. For example, try the following experiment in SML/NJ's REPL:

- print "caf\u00c3\u00a9\n";
café
- print "cafe\u00cc\u0081\n";
café

In this case, the first string is normalized, whereas the second is not. I think that we should require that the contents of a string literal be normalized UTF-8; to define unnormalized string values as literals would require using escape sequences (just as including tabs in a string requires using \t).

@rossberg-old
Copy link

@JohnReppy, is there much benefit to checking normalisation? Making that a requirement introduces a fairly deep dependency on Unicode, which means that all compilers would need to pull in a Unicode library. It also means that, technically, the language potentially changes with each update to Unicode, which seems rather undesirable.

It seems to me that it would be good enough if compilers merely checked that the literal is a valid UTF-8 encoding, which requires no actual interpretation of code points and should be stable.

@rossberg-old
Copy link

rossberg-old commented May 5, 2017

However, I actually think that it is backwards to define the lexical grammar itself in terms of an encoding. To support Unicode properly, the grammar needs to be specified in terms of code points (which is a very simple change to the Definition). That gives you Unicode in comments for free. The requirement for string literals is that they may not contain code points outside the respective range of the resolved string type. Encoding issues would be outside the language spec, so that you can convert source code from one encoding to another.

As an extension (and that is a separate feature), you could then specify that the default char type actually means "byte", and that literals for the default string may contain larger code points that get auto-encoded into UTF-8.

Reinterpreting char as byte is rather hacky, the more purist but cumbersome solution would be to require the use of Int8Vector for UTF-8 "strings" -- that makes clear that it does not consist of characters, and that applying char functions to its elements makes no sense. (Of course, when designing a language from scratch, the right approach is to not make string a vector of chars in the first place.)

@JohnReppy
Copy link
Contributor

The reason for specifying UTF-8 is for source-code portability; are people really going to be writing code in UTF-16 or UTF-32? Furthermore, all existing SML code is already UTF-8.

You may be right that it would be better to just specify the set of allowed code points, instead of requiring Normalization. The goal should be to ensure that a given rendering of a string literal in source code has only one meaning in terms of a byte sequence. (I know that there are some code points that have glyphs that look like other glyphs, which is a related problem to combining characters).

@rossberg-old
Copy link

Well, I understand, but it introduces a conceptual conflation: people are not writing code "in UTF-8" either, they are writing it in a character set. The encoding is just a transport format on a lower level. I am aware that the same mixture of abstraction levels has been introduced in many other legacy systems, but that doesn't make it great. It would be nice if we could avoid it. (Backwards compatibility between UTF-8 and ASCII source files is not affected, AFAICS.)

Re normalisation: the goal is noble, but it looks like a hopeless battle against properties that are inherent to Unicode. Should the language spec really get into the business of "fixing" Unicode? Strings are just data, so it seems fine to leave their sanitisation to the programmer.

@JohnReppy
Copy link
Contributor

The flexibility of being able to use arbitrary encodings for source code is a nice ideal, but is it worth the extra cost in implementation complexity? In the world of the web, even though alternative encodings are supported, the trend is toward standardization on UTF-8 (85% as of 2015). Requiring compilers to deal with multiple encodings seems like an unnecessary burden.

I think that for the same reason that we do not allow tabs or newlines inside string literals, we should restrict unicode literals to an unambiguous subset of codepoints; if the programmer wants something outside that subset he/she can use escape sequences. Leaving sanitization to the programmer is a bad idea; the point is to have the compiler help the programmer avoid mistakes that are hard to detect manually (sort of like type checking :).

@rossberg-old
Copy link

Ah, I'm not suggesting that compilers need to handle all possible encodings -- that's up to each compiler as much as its handling of, say, file names is. I'd even be fine with saying in the Definition that only UTF-8 needs to be supported. But I still think it's preferable to keep the abstraction levels straight.

I'm not sure I buy the analogy to type checking -- data will typically originate from external sources, so in general you have to sanitise its representations anyway where it's relevant. In many other cases it is irrelevant. I doubt that applying extra rigor to literals buys you much for anything but toy examples.

The analogy probably is closer to something like floating-point literals: by similar arguments we could prescribe to reject float literals that overflow into infinity (or which have fractional digits beyond the representable precision). But we don't bother.

@ratmice
Copy link
Contributor

ratmice commented May 5, 2017

I'm not sure about normalization, it seems that even if you go through normalization, you will still have characters that appear equal but are not in fact equal.
Duplicate characters in Unicode

@JohnReppy
Copy link
Contributor

If you allow other encodings, then you have to add some sort of mechanism to allow the encoding to be detected, which adds complexity for a feature that is likely to rarely be supported and even more rarely used.

I'm not talking about external sources of data; I'm talking about string literals in the source code, which, presumably, were entered by the programmer.

We reject integer literals that are too large, so we should probably be rejecting floats that are too large too (actually, SML/NJ does so, but in the code generator). The treatment of floating point in SML compilers could be better.

@rossberg-old
Copy link

If a compiler chooses to support multiple encodings (which it doesn't have to) then the user needs to indicate non-default choices. For example, GCC has a command line switch. That wouldn't concern the language definition.

I'm arguing that in the grand scheme of handling textual data, literals matter least, and you cannot diagnose the more relevant cases. It's nice to have more diagnostics, but in this case I'm doubtful about the benefit to cost ratio.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants