Fix UTF8 in JS getChar#111
Conversation
The way getChar was implemented in JS was not matching the native implementation. The native version return a codepoint, the JS version was supposed to do the same (using a construction called .codePointAt), but it was not working on some utf8 encoded examples (for reasons that I don't understand). This diff mimics the logic that we have natively. It would be probably worth revisiting at some point, because it would be nicer to use JS primitives to do the work, but it's not urgent. Let's fix our JS first. The diff also includes tests on utf8 strings to make sure we don't regress.
| const codePoints = new Array(); | ||
|
|
||
| for (const char of str) { | ||
| const codePoint = char.codePointAt(0); |
There was a problem hiding this comment.
I'm not an expert, but I thought that "code point" in JS is a term that refers to a 16-bit integer, e.g. that could be an element of a UTF-16 sequence. And so explicitly getting UTF8 "code units" requires some other library / manual coding.
There was a problem hiding this comment.
It's surprising that this is failing, as for (... of ...) + codePointAt(0) seems to be documented as working?
There was a problem hiding this comment.
Looking here it seems clear that the returned "code points" are not bytes, but decode UTF-16 into (21-bit?) integers. On the other hand, charCodeAt just gives the 16-bit int elements of a UTF-16 sequence, and leaves the decoding of the use of 1 or 2 of them per code point to the caller. Neither does anything to get to a UTF-8 byte sequence. I suppose that the decoding of UTF-16 surrogate pairs into a single number could be done with codePointAt and then that single int converted into 1-4 bytes manually in a way similar to the code in this PR. Either approach seems fine to me, not sure whether to prefer one over another.
There was a problem hiding this comment.
I'm not an expert, but I thought that "code point" in JS is a term that refers to a 16-bit integer, e.g. that could be an element of a UTF-16 sequence. And so explicitly getting UTF8 "code units" requires some other library / manual coding.
A code unit is one or more bytes in a character encoding (utf-8 and utf-16 are examples of char encodings. JS uses utf-16 internally for strings) that encode a (almost always unicode) code point. One or more code points are assembled together to actually get a glyph that is rendered to screen. utf-8 uses up to 4 bytes to represent a code point (historically 6 but that was deprecated) and utf-16 uses 1 or 2 (99% sure on this, it's been quite a while since I worked closely with utf-16).
There was a problem hiding this comment.
Looking here it seems clear that the returned "code points" are not bytes, but decode UTF-16 into (21-bit?) integers.
Code points are never bytes. There are too many. There are roughly a million defined by unicode, so 21 bits sounds about right.
On the other hand, charCodeAt just gives the 16-bit int elements of a UTF-16 sequence, and leaves the decoding of the use of 1 or 2 of them per code point to the caller. Neither does anything to get to a UTF-8 byte sequence. I suppose that the decoding of UTF-16 surrogate pairs into a single number could be done with codePointAt and then that single int converted into 1-4 bytes manually in a way similar to the code in this PR.
To go from utf-16 to utf-8, you need to scan the byte sequence and convert to a code point and then back to a utf-8 representation of the code point.
| if (charcode < 0x80) { | ||
| utf8.push(charcode); | ||
| } else if (charcode < 0x800) { | ||
| utf8.push((charcode >> 6) | 0xc0); | ||
| utf8.push((charcode & 0x3f) | 0x80); | ||
| } else if (charcode < 0xd800 || charcode >= 0xe000) { | ||
| utf8.push((charcode >> 12) | 0xe0); | ||
| utf8.push(((charcode >> 6) & 0x3f) | 0x80); | ||
| utf8.push((charcode & 0x3f) | 0x80); | ||
| } else { | ||
| // surrogate pair | ||
| i++; | ||
| charcode = 0x10000 + (((charcode & 0x3ff) << 10) | (str.charCodeAt(i) & 0x3ff)); | ||
| utf8.push((charcode >> 18) | 0xf0); | ||
| utf8.push(((charcode >> 12) & 0x3f) | 0x80); | ||
| utf8.push(((charcode >> 6) & 0x3f) | 0x80); | ||
| utf8.push((charcode & 0x3f) | 0x80); | ||
| } |
There was a problem hiding this comment.
I don't understand why we use arithmetic operations in Char.utf8encode and bitwise operations here, but the functionality looks the same to me (assuming JS has sane semantics of bitwise ops).
|
Thanks guys! |
stdin is a byte stream and in my opinion should assume utf8. This seems to be the assumption we have everywhere else. So I changed the type to reflect this and then can use the built in TextEncoder to convert strings. Now this is JS so all bets are off, but I would hope this is likely more correct and more efficient than anything we've hand rolled. But it should definitely produce (slightly) smaller code. All tests pass locally, including those added in pr SkipLabs#111.
I pulled this down and played around. #113 seems to work for me locally. I'll check CI. LMK if you have concerns. |
stdin is a byte stream and in my opinion should assume utf8. This seems to be the assumption we have everywhere else. So I changed the type to reflect this and then can use the built in TextEncoder to convert strings. Now this is JS so all bets are off, but I would hope this is likely more correct and more efficient than anything we've hand rolled. But it should definitely produce (slightly) smaller code. All tests pass locally, including those added in pr SkipLabs#111.
stdin is a byte stream and in my opinion should assume utf8. This seems to be the assumption we have everywhere else. So I changed the type to reflect this and then can use the built in TextEncoder to convert strings. Now this is JS so all bets are off, but I would hope this is likely more correct and more efficient than anything we've hand rolled. But it should definitely produce (slightly) smaller code. All tests pass locally, including those added in pr SkipLabs#111.
The way getChar was implemented in JS was not matching the native implementation. The native version return a codepoint, the JS version was supposed to do the same (using a construction called .codePointAt), but it was not working on some utf8 encoded examples (for reasons that I don't understand).
This diff mimics the logic that we have natively. It would be probably worth revisiting at some point, because it would be nicer to use JS primitives to do the work, but it's not urgent. Let's fix our JS first.
The diff also includes tests on utf8 strings to make sure we don't regress.