-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use UTF-8 instead of ISO-8859-1 #601
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! A few typos and a few serious questions are sprinkled in the code.
I also did a search for "\#
and '\#
(and with \x
) in the source code and found several occurrences, some of which you addressed, but I wonder if you need to consider the others?
CSyntax.hs: where ppApArg ty = t"\183" <> pPrint d maxPrec ty
CSyntax.hs: where ppApArg ty = t"\183" <> pPrint d maxPrec ty
CVPrint.hs: where ppApArg ty = t"\183" <> pvPrint d maxPrec ty
CVPrint.hs: where ppApArg ty = t"\183" <> pvPrint d maxPrec ty
IExpand.hs: pPrint d p (T t) = text"\183" <> pPrint d p t
ISyntax.hs: sep (pPrint d (maxPrec-1) f : map (nest 2 . (text"\183" <>) . pPrint d maxPrec) ts ++ map (nest 2 . pPrint d maxPrec) es)
Id.hs: "\xbb"-> FInfixr 12 -- https://en.wikipedia.org/wiki/Guillemet or '>>'
Id.hs: "\xb7"-> FInfixr 13 -- https://en.wikipedia.org/wiki/Interpunct or '.'
bsc.hs: escChar '$' accum_str = "\044" ++ accum_str
GenWrapUtils.hs:genSuffix = ['\173']
Lex.hs:isSym c | c >= '\x80' = c `elem` ['\162', '\163', '\164', '\165', '\166',
Lex.hs: '\167', '\168', '\169', '\170', '\171',
Lex.hs: '\172', '\173', '\174', '\175', '\176',
Lex.hs: '\177', '\178', '\179', '\180', '\181',
Lex.hs: '\183', '\184', '\185', '\186', '\187',
Lex.hs: '\188', '\189', '\190', '\191', '\215',
Lex.hs: '\247' ]
Lex.hs:--isSym c | c >= '\x80' = isSymbol c
Lex.hs:isIdChar '\'' = True
Lex.hs:isIdChar '\176' = True
Lex.hs:isIdChar '\180' = True
I notice that not only has the handling of text files changed, but also the handling of binary files has changed. Was this necessary because of the change in encoding, or something you cleaned up while you happened to be working in this area? I do see that some functions in FileIOUtil
were supporting reading in a file as either text or binary, and thus some of that was made cleaner by having both formats be read in as ByteString
, and only at the end do you unpack into [Word8]
for binary or decode UTF8 into String
for text.
You switched away from hGetContents
(in SystemIO
) which returns a String
, to reading in the file as a ByteString
(specifically the lazy version), which is then packed into [Word8]
and returned (in place of String
).
I was not up to date on current best practices, but I found this page on dealing with binary data in Haskell to be helpful. According to that, normal Haskell String
types are lists of 32-bit characters, which is more space than is needed and slows things down. So hopefully we should see an improvement in memory usage?
Switching the symbol for composition is good, because the centerdot is also used by the pretty-printer for displaying type arguments (as you noted when having to update the expected output of dumps in the test suite, and as seen above in the source code lines that print \183
).
I see that you updated the headers of the .bo
and .ba
files, which is good. I assume that the old files will still parse as UTF-8 up through the header, to be able to do that check, right?
We do need to include updates to the documentation in this PR. Presumably the compose operator is documented (or used) in the BH reference guide. I'd guess that it's not mentioned in the Libraries Guide, since that is in BSV. Do we also need to mention the UTF-8 encoding somewhere (like the BSC User Guide)? Any other doc changes that I'm missing?
I'm excited that this will allow users to start writing programs with identifiers in languages that aren't covered by latin1, as seen in your test cases. As I indicated inline, I'm unclear how uppercase and lowercase are handled in other languages and what that means for defining constructors versus variables.
It might also be worth testing the encoding in other places -- for example, in the package and file name?
src/comp/FileIOUtil.hs
Outdated
readBinFilePath errh pos verb name path = | ||
readFilesPath' errh pos True verb [name] path | ||
Bool -> String -> [String] -> IO (Maybe ([Word8], String)) | ||
readBinFilePath errh pos verb name path = maybe Nothing unp <$> readFilesPath' errh pos verb [name] path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may just be me (as I say below as well), but I might have written this as:
readFilesPath' errh pos verb [name] path
>>= fmap unp
where unp (bs, name) = (B.unpack bs, name)
or even
>>= fmap (apFst B.unpack)
But this is fine. It just took me a while to untangle it in my head.
However, thinking on it longer, I wonder if you should write this in a style that matches what you're doing in all the other places where you use decode
. In this case, you're not decoding, just unpacking; but I wonder if it would be more redable to just write it the same way, with a different function name. (That also maybe suggests that the decode situations might be written more simply by using fmap
? Anyway, I have other comments on the decode cases, below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll resurrect mapFst in Util.hs (currently commented out and only defined for lists) and use your second alternative.
thanks for the near-instant detailed review :)
these all print
this is inside a comment. presumably from ancient days where fixity was actually fixed and not specified via
i think \044 is
this should be safe — the second unicode block (0x80-0xFF) is the same as the latin1 characters, so you'll get the same code point.
i just added
the binary stuff assumed that one character is always one byte, which is not the case in utf8. so without this change code points that don't fit in one byte would be written (and parsed back as) symbols different from what was in the source. this matters for identifiers for example.
i had to read a BS anyway to decode utf8, so it made sense to just do that and decode based on bin vs not. i made it lazy b/c it was lazy before, could possibly be that strict is better but i didn't check.
i didn't measure memory usage, and the string-based encoding could well have been lazy enough to be deforested. in general this whole "binary" thing is not very efficient as-is; at the very least generation could be improved to use Builders or something, and parsing could use idk attoparsec to be easier to read at least. or maybe one could use protobuf; i vaguely recall that there was even something that did this via generics without even needing but i just wanted to get unicode working...
i don't see why not, decoding would fail only at the first byte sequence that is not valid utf8, and you'd never get there b/c laziness and such.
can you point me to where the compose operator appears? i skimmed the docs but couldn't find
most scripts not related to ours actually don't make the uppercase/lowercase/titlecase distinctions, and even the latin, greek, cyrillic, etc. alphabets didn't have both uppercase/lowercase forms until relatively recently in history (cf. roman-era inscriptions all over southern and western europe). what that means for constructors vs variables is probably that the distinction is biased to european scripts, or to the fact that programs are commonly written in the latin script (which is just the facts ma'am), depending how generous versus pitchfork-ferous you wish to be. actually in japanese one might achieve something similar to uppercase by writing something normally written in kanji/hiragana in katakana instead, which kind of has an all-uppercase feeling. but i don't believe most languages have the luxury of multiple phonetically equivalent writing systems in concurrent use like that (maybe serbo-croatian and kazakh get a honourable mention?), and anyway mixing katakana and hiragana in one word like a haskell/bluespec constructor doesn't make sense, so the appropriate distinction here would be all-katakana vs all-kanji/hiragana. technically the bs spec (which i got from the documentation repo, which i guess is outdated?) only says uppercase and lowercase (as does the haskell spec), but this would preclude e.g., kanji identifiers, so i decided to follow ghc and consider everything that is not uppercase as lowercase. plus surprising things are considered identifier letters in bs, including for example the SV spec specifically specifies ASCII character ranges A-Za-z, but i think it makes sense to match BH here rather than attempt to preserve a false illusion that BSV is actually SV in any meaningful sense.
good point, will do. |
Just a quick question: Are we planning to support both the I'm happy with either answer, I just am curious. |
Also, does anyone know a good way to type it on a mac keyboard? |
i'd like to keep it to one thing, so i guess
to convert, this is the black magic i did with the bsc sources:
(after doing essentially the same w/ |
hmm good question, i just looked at the us/qwerty layout and none of the circle things seem to be personally the default apple layouts don't meet my needs, so i have a custom keyboard layout (made using ukelele) anyway, and i already had you could also switch your keyboard layout to unicode hex input, hold down option, and type 2 2 1 8 (the code point for i guess another is to locally redefine i am somewhat hesitant to choose bs syntax based on whatever is easy to type on a specific platform, since next we'll have complaints from folks on windows, linux, and idk amigaos running on repurposed hardware left over from the nasa mars rover missions. |
Fair enough. Just curious, why did we settle on
Do you happen to know a way to force an import from the command line. Adding "import PreludeExtra" to every file is a little icky. |
The rationale is that this is the usual mathematical compose operator (see the source of all knowledge on this), which in theory would aid clarity. The really unfortunate thing here is that we can't make
You could probably hack something up with preprocessor-level defines, but for my taste that's even more icky... Personally what I would add an import to every file as you said — it's pretty common to import |
FYI, I was able to boil down the GHC 9.2 failure to a small example and have submitted it as a GHC issue: https://gitlab.haskell.org/ghc/ghc/-/issues/23891 |
+10 to house quark for the dedication! i believe 9.0.2 worked, it was just 9.2.8 that failed, not sure if you want to correct the ghc issue text. |
I pushed two commits. The first cleans up a few things:
The second commit is my proposal for cleaning up the |
+1.
+1. iirc i made Char write a string anyway b/c otherwise you don't know how many bytes to read (1–4), but this was def a transparent bug. nice catch.
looks good to me as well, thank you! |
For the purposes of distinguishing variable and constructor / type / package identifiers, Unicode uppercase and titlecase letters are considered uppercase letters; everything else is considered lowercase.
Ok, we are almost there! I have rebased and squashed the commits into just three: use UTF8, change the compose operator, bump the GHC version. All other commits were squashed into the first (use UTF8). I also changed the GHC commit to bump to 9.4.6 (instead of 9.4.5), since that is now available. I've force-pushed this new version of the branch. Once the CI has finished, I am ready to merge this PR, unless you have any last objections. Thank you! |
Oh, some last comments: We will need to document a number of things, but we can do that separately. Specifically: (1) that the encoding is Unicode; (2) the BH compose operator (although there's no place for that, since the library document comes in only one version, currently for BSV users); (3) any special characters that BSC allows as part of BH identifiers (see next paragraph)? You noticed that You removed the explicit |
IIRC I initially accidentally allowed these two characters ( Either way, it would be nice if this used an encoding that cannot be aliased in |
BTW, I tried removing |
The checks passed when run with ghc 9.4.5, but when I updated to 9.4.6, there was one failure on macos-13 (a test in |
tested on ventura/arm64 with 9.4.6, all typechecker tests passed. looking at the testcase, i suspect it's a fluke as well, i don't see how any of your changes would cause failure in just that one specific case. (full disclosure: everything passed except a bunch of systemc related tests, but i don't have systemc installed). |
The GitHub CI passed for the exact same code but with GHC 9.4.5. Then it failed with GHC 9.4.6 and failed again in the same way when I repeated the jobs. It then passed with 9.4.7. So I think this is probably an issue with GHC 9.4.6. I'm happy to merge the PR now, using 9.4.7. (The GitHub CI for macos-13 is x86_64, which might account for the difference with your local testing. Or maybe it's using a slightly different version of macos-13.) The failure with 9.4.6 is due to a reordering of ISyntax defs. It's possible that the blame isn't entirely with GHC 9.4.6 and maybe BSC is relying on the stability of some functions (like tsort or hash or something) that it shouldn't. If so, that would be something that was already in BSC and unlikely to be something we introduced here. (We could try an experiment of running the current |
Yeah, I think that's reasonable.
Yeah, it feels like some nondeterminism due to some environment settings, doesn't it. Anyway, it looks to me like the nondeterminism here is benign. |
∘