Fix UTF8 in JS getChar by pikatchu · Pull Request #111 · SkipLabs/skip

pikatchu · 2024-02-13T09:52:06Z

The way getChar was implemented in JS was not matching the native implementation. The native version return a codepoint, the JS version was supposed to do the same (using a construction called .codePointAt), but it was not working on some utf8 encoded examples (for reasons that I don't understand).

This diff mimics the logic that we have natively. It would be probably worth revisiting at some point, because it would be nicer to use JS primitives to do the work, but it's not urgent. Let's fix our JS first.

The diff also includes tests on utf8 strings to make sure we don't regress.

The way getChar was implemented in JS was not matching the native implementation. The native version return a codepoint, the JS version was supposed to do the same (using a construction called .codePointAt), but it was not working on some utf8 encoded examples (for reasons that I don't understand). This diff mimics the logic that we have natively. It would be probably worth revisiting at some point, because it would be nicer to use JS primitives to do the work, but it's not urgent. Let's fix our JS first. The diff also includes tests on utf8 strings to make sure we don't regress.

jberdine

LGTM

jberdine · 2024-02-13T10:30:46Z

-  const codePoints = new Array();
-
-  for (const char of str) {
-    const codePoint = char.codePointAt(0);


I'm not an expert, but I thought that "code point" in JS is a term that refers to a 16-bit integer, e.g. that could be an element of a UTF-16 sequence. And so explicitly getting UTF8 "code units" requires some other library / manual coding.

It's surprising that this is failing, as for (... of ...) + codePointAt(0) seems to be documented as working?

Looking here it seems clear that the returned "code points" are not bytes, but decode UTF-16 into (21-bit?) integers. On the other hand, charCodeAt just gives the 16-bit int elements of a UTF-16 sequence, and leaves the decoding of the use of 1 or 2 of them per code point to the caller. Neither does anything to get to a UTF-8 byte sequence. I suppose that the decoding of UTF-16 surrogate pairs into a single number could be done with codePointAt and then that single int converted into 1-4 bytes manually in a way similar to the code in this PR. Either approach seems fine to me, not sure whether to prefer one over another.

I'm not an expert, but I thought that "code point" in JS is a term that refers to a 16-bit integer, e.g. that could be an element of a UTF-16 sequence. And so explicitly getting UTF8 "code units" requires some other library / manual coding.

A code unit is one or more bytes in a character encoding (utf-8 and utf-16 are examples of char encodings. JS uses utf-16 internally for strings) that encode a (almost always unicode) code point. One or more code points are assembled together to actually get a glyph that is rendered to screen. utf-8 uses up to 4 bytes to represent a code point (historically 6 but that was deprecated) and utf-16 uses 1 or 2 (99% sure on this, it's been quite a while since I worked closely with utf-16).

Looking here it seems clear that the returned "code points" are not bytes, but decode UTF-16 into (21-bit?) integers.

Code points are never bytes. There are too many. There are roughly a million defined by unicode, so 21 bits sounds about right.

On the other hand, charCodeAt just gives the 16-bit int elements of a UTF-16 sequence, and leaves the decoding of the use of 1 or 2 of them per code point to the caller. Neither does anything to get to a UTF-8 byte sequence. I suppose that the decoding of UTF-16 surrogate pairs into a single number could be done with codePointAt and then that single int converted into 1-4 bytes manually in a way similar to the code in this PR.

To go from utf-16 to utf-8, you need to scan the byte sequence and convert to a code point and then back to a utf-8 representation of the code point.

jberdine · 2024-02-13T10:36:42Z

+    if (charcode < 0x80) {
+      utf8.push(charcode);
+    } else if (charcode < 0x800) {
+      utf8.push((charcode >> 6) | 0xc0);
+      utf8.push((charcode & 0x3f) | 0x80);
+    } else if (charcode < 0xd800 || charcode >= 0xe000) {
+      utf8.push((charcode >> 12) | 0xe0);
+      utf8.push(((charcode >> 6) & 0x3f) | 0x80);
+      utf8.push((charcode & 0x3f) | 0x80);
+    } else {
+      // surrogate pair
+      i++;
+      charcode = 0x10000 + (((charcode & 0x3ff) << 10) | (str.charCodeAt(i) & 0x3ff));
+      utf8.push((charcode >> 18) | 0xf0);
+      utf8.push(((charcode >> 12) & 0x3f) | 0x80);
+      utf8.push(((charcode >> 6) & 0x3f) | 0x80);
+      utf8.push((charcode & 0x3f) | 0x80);
+    }


I don't understand why we use arithmetic operations in Char.utf8encode and bitwise operations here, but the functionality looks the same to me (assuming JS has sane semantics of bitwise ops).

pikatchu · 2024-02-13T14:02:50Z

Thanks guys!

stdin is a byte stream and in my opinion should assume utf8. This seems to be the assumption we have everywhere else. So I changed the type to reflect this and then can use the built in TextEncoder to convert strings. Now this is JS so all bets are off, but I would hope this is likely more correct and more efficient than anything we've hand rolled. But it should definitely produce (slightly) smaller code. All tests pass locally, including those added in pr SkipLabs#111.

gregsexton · 2024-02-13T16:30:09Z

It would be probably worth revisiting at some point, because it would be nicer to use JS primitives to do the work, but it's not urgent. Let's fix our JS first.

I pulled this down and played around. #113 seems to work for me locally. I'll check CI. LMK if you have concerns.

stdin is a byte stream and in my opinion should assume utf8. This seems to be the assumption we have everywhere else. So I changed the type to reflect this and then can use the built in TextEncoder to convert strings. Now this is JS so all bets are off, but I would hope this is likely more correct and more efficient than anything we've hand rolled. But it should definitely produce (slightly) smaller code. All tests pass locally, including those added in pr SkipLabs#111.

pikatchu requested a review from beauby February 13, 2024 09:52

jberdine approved these changes Feb 13, 2024

View reviewed changes

pikatchu merged commit 11b6901 into SkipLabs:main Feb 13, 2024

gregsexton mentioned this pull request Feb 13, 2024

Represent stdin as a utf8 byte array and use TextEncoder #113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF8 in JS getChar#111

Fix UTF8 in JS getChar#111
pikatchu merged 1 commit into
SkipLabs:mainfrom
pikatchu:fix_utf8_js

pikatchu commented Feb 13, 2024

Uh oh!

jberdine left a comment

Uh oh!

jberdine Feb 13, 2024

Uh oh!

beauby Feb 13, 2024

Uh oh!

jberdine Feb 13, 2024

Uh oh!

gregsexton Feb 13, 2024

Uh oh!

gregsexton Feb 13, 2024

Uh oh!

jberdine Feb 13, 2024

Uh oh!

pikatchu commented Feb 13, 2024

Uh oh!

gregsexton commented Feb 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pikatchu commented Feb 13, 2024

Uh oh!

jberdine left a comment

Choose a reason for hiding this comment

Uh oh!

jberdine Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

beauby Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

jberdine Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

gregsexton Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

gregsexton Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

jberdine Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

pikatchu commented Feb 13, 2024

Uh oh!

gregsexton commented Feb 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants