Skip to content

Fix UTF8 in JS getChar#111

Merged
pikatchu merged 1 commit into
SkipLabs:mainfrom
pikatchu:fix_utf8_js
Feb 13, 2024
Merged

Fix UTF8 in JS getChar#111
pikatchu merged 1 commit into
SkipLabs:mainfrom
pikatchu:fix_utf8_js

Conversation

@pikatchu
Copy link
Copy Markdown
Contributor

The way getChar was implemented in JS was not matching the native implementation. The native version return a codepoint, the JS version was supposed to do the same (using a construction called .codePointAt), but it was not working on some utf8 encoded examples (for reasons that I don't understand).

This diff mimics the logic that we have natively. It would be probably worth revisiting at some point, because it would be nicer to use JS primitives to do the work, but it's not urgent. Let's fix our JS first.

The diff also includes tests on utf8 strings to make sure we don't regress.

The way getChar was implemented in JS was not matching the native
implementation. The native version return a codepoint, the JS version
was supposed to do the same (using a construction called
.codePointAt), but it was not working on some utf8 encoded examples
(for reasons that I don't understand).

This diff mimics the logic that we have natively. It would be probably
worth revisiting at some point, because it would be nicer to use JS
primitives to do the work, but it's not urgent. Let's fix our JS
first.

The diff also includes tests on utf8 strings to make sure we don't
regress.
@pikatchu pikatchu requested a review from beauby February 13, 2024 09:52
Copy link
Copy Markdown
Contributor

@jberdine jberdine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

const codePoints = new Array();

for (const char of str) {
const codePoint = char.codePointAt(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert, but I thought that "code point" in JS is a term that refers to a 16-bit integer, e.g. that could be an element of a UTF-16 sequence. And so explicitly getting UTF8 "code units" requires some other library / manual coding.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's surprising that this is failing, as for (... of ...) + codePointAt(0) seems to be documented as working?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking here it seems clear that the returned "code points" are not bytes, but decode UTF-16 into (21-bit?) integers. On the other hand, charCodeAt just gives the 16-bit int elements of a UTF-16 sequence, and leaves the decoding of the use of 1 or 2 of them per code point to the caller. Neither does anything to get to a UTF-8 byte sequence. I suppose that the decoding of UTF-16 surrogate pairs into a single number could be done with codePointAt and then that single int converted into 1-4 bytes manually in a way similar to the code in this PR. Either approach seems fine to me, not sure whether to prefer one over another.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert, but I thought that "code point" in JS is a term that refers to a 16-bit integer, e.g. that could be an element of a UTF-16 sequence. And so explicitly getting UTF8 "code units" requires some other library / manual coding.

A code unit is one or more bytes in a character encoding (utf-8 and utf-16 are examples of char encodings. JS uses utf-16 internally for strings) that encode a (almost always unicode) code point. One or more code points are assembled together to actually get a glyph that is rendered to screen. utf-8 uses up to 4 bytes to represent a code point (historically 6 but that was deprecated) and utf-16 uses 1 or 2 (99% sure on this, it's been quite a while since I worked closely with utf-16).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking here it seems clear that the returned "code points" are not bytes, but decode UTF-16 into (21-bit?) integers.

Code points are never bytes. There are too many. There are roughly a million defined by unicode, so 21 bits sounds about right.

On the other hand, charCodeAt just gives the 16-bit int elements of a UTF-16 sequence, and leaves the decoding of the use of 1 or 2 of them per code point to the caller. Neither does anything to get to a UTF-8 byte sequence. I suppose that the decoding of UTF-16 surrogate pairs into a single number could be done with codePointAt and then that single int converted into 1-4 bytes manually in a way similar to the code in this PR.

To go from utf-16 to utf-8, you need to scan the byte sequence and convert to a code point and then back to a utf-8 representation of the code point.

Comment on lines +205 to +222
if (charcode < 0x80) {
utf8.push(charcode);
} else if (charcode < 0x800) {
utf8.push((charcode >> 6) | 0xc0);
utf8.push((charcode & 0x3f) | 0x80);
} else if (charcode < 0xd800 || charcode >= 0xe000) {
utf8.push((charcode >> 12) | 0xe0);
utf8.push(((charcode >> 6) & 0x3f) | 0x80);
utf8.push((charcode & 0x3f) | 0x80);
} else {
// surrogate pair
i++;
charcode = 0x10000 + (((charcode & 0x3ff) << 10) | (str.charCodeAt(i) & 0x3ff));
utf8.push((charcode >> 18) | 0xf0);
utf8.push(((charcode >> 12) & 0x3f) | 0x80);
utf8.push(((charcode >> 6) & 0x3f) | 0x80);
utf8.push((charcode & 0x3f) | 0x80);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we use arithmetic operations in Char.utf8encode and bitwise operations here, but the functionality looks the same to me (assuming JS has sane semantics of bitwise ops).

@pikatchu pikatchu merged commit 11b6901 into SkipLabs:main Feb 13, 2024
@pikatchu
Copy link
Copy Markdown
Contributor Author

Thanks guys!

gregsexton added a commit to gregsexton/skdb that referenced this pull request Feb 13, 2024
stdin is a byte stream and in my opinion should assume utf8. This
seems to be the assumption we have everywhere else. So I changed the
type to reflect this and then can use the built in TextEncoder to
convert strings.

Now this is JS so all bets are off, but I would hope this is likely
more correct and more efficient than anything we've hand rolled. But
it should definitely produce (slightly) smaller code.

All tests pass locally, including those added in pr SkipLabs#111.
@gregsexton
Copy link
Copy Markdown
Contributor

It would be probably worth revisiting at some point, because it would be nicer to use JS primitives to do the work, but it's not urgent. Let's fix our JS first.

I pulled this down and played around. #113 seems to work for me locally. I'll check CI. LMK if you have concerns.

gregsexton added a commit to gregsexton/skdb that referenced this pull request Feb 14, 2024
stdin is a byte stream and in my opinion should assume utf8. This
seems to be the assumption we have everywhere else. So I changed the
type to reflect this and then can use the built in TextEncoder to
convert strings.

Now this is JS so all bets are off, but I would hope this is likely
more correct and more efficient than anything we've hand rolled. But
it should definitely produce (slightly) smaller code.

All tests pass locally, including those added in pr SkipLabs#111.
gregsexton added a commit to gregsexton/skdb that referenced this pull request Feb 15, 2024
stdin is a byte stream and in my opinion should assume utf8. This
seems to be the assumption we have everywhere else. So I changed the
type to reflect this and then can use the built in TextEncoder to
convert strings.

Now this is JS so all bets are off, but I would hope this is likely
more correct and more efficient than anything we've hand rolled. But
it should definitely produce (slightly) smaller code.

All tests pass locally, including those added in pr SkipLabs#111.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants