Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(qwik-core): Uint8Array serializer #5846

Merged
merged 30 commits into from
Feb 27, 2024

Conversation

genki
Copy link
Contributor

@genki genki commented Feb 12, 2024

Overview

Serializer for the Uint8Array.

What is it?

  • Feature / enhancement
  • Bug
  • Docs / tests / types / typos

Description

As the Uint8Array can be used for large data blob, its way is compactness conscious.
It encodes the Uint8Array into the UTF-16 code in 2 bytes-wise.
To achieve the compactness, there ended up being two kinds of Uint8Array, even and odd length ones.
So there are two serializers Uint8ArrayESerializer and Uint8ArrayOSerializer to distinguish them.

Use cases and why

The Uint8Array is widely used for data buffer, credencial data and so on compared to other typed arrays.
Fixes #4416

Checklist:

  • My code follows the developer guidelines of this project
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • Added new tests to cover the fix / functionality

Copy link

netlify bot commented Feb 12, 2024

👷 Deploy request for qwik-insights pending review.

Visit the deploys page to approve it

Name Link
🔨 Latest commit 4f0023c

@genki genki changed the title Added Uint8Array serializer feat(qwik-core): Added Uint8Array serializer Feb 12, 2024
@genki genki changed the title feat(qwik-core): Added Uint8Array serializer feat(qwik-core): Uint8Array serializer Feb 12, 2024
Copy link
Member

@wmertens wmertens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, but:

  • it's missing tests
  • Most likely you'll need an escape mechanism?
  • the code can be deduplicated by moving the functions out and making them into factories depending on odd/even

👍

@genki
Copy link
Contributor Author

genki commented Feb 14, 2024

@wmertens I see, this has some problems.

@genki genki closed this Feb 14, 2024
@wmertens
Copy link
Member

@genki don't get me wrong, this could probably be useful so a solidly tested PR would still be welcome.

@genki
Copy link
Contributor Author

genki commented Feb 14, 2024

@wmertens Thank you. It's just my careless.
I found this code is needing large amount of testing on various environment to check if the unmatched surrogate pairs in the encoded strings are replaced with 0xFFDF.
JavaScript permits the existence of those invalid UTF-16 strings, but no guarantees on its treatment.
This behaviour is depending on browsers or intermediate text processing libraries.
Unfortunately, currently I have no time to do such testing.

@genki
Copy link
Contributor Author

genki commented Feb 14, 2024

I have fixed the previous implementation to serialize Uint8Array into valid UTF-16 string without unmatched surrogate pairs.
Now the two serializers are into one.
I have tested locally more than 1,000,000 times over random buffers and the estimated redundancy is less than 7% in bytes. (The surrogate pairs are in 0xD800-0xDFFF = 2048. 2048/65536 = 3.125%. As they are escaped and doubled in size, it is around 6.25% increase)
But I am not being familiar with how place the serialization tests to be in the core package.
Where I have to see?
I have thought so called sanitization is out of the responsibility of the serializer module.

@genki genki reopened this Feb 14, 2024
@wmertens
Copy link
Member

Wow that's a lot of handling needed :) are you sure that it's impossible for an even sequence to look like an odd one?

You can add any test files you like.

It's a bit heavy to review for me right now, I'll get back to this later. Notice the lint error btw.

@genki
Copy link
Contributor Author

genki commented Feb 14, 2024

@wmertens
Yep, I am not hurrying :)
The odd sequence has a marker at the end like0xFFFD, 0x00XX, where the XX is the last byte of the Uint8Array.
I know it is possible to make more space efficient algorithm if we use the bitwise encoding.
But it introduces more computational cost in JavaScript. I think the bytewise approach is neat balance.

@wmertens
Copy link
Member

Right but I meant, what if the even array has the same bytes at the end? Or is that sequence reserved by the escaping?

@genki
Copy link
Contributor Author

genki commented Feb 15, 2024

@wmertens
That is why the escape character 0xFFFD is put in front of the last byte. Every escape character themselves are encoded into doubled as 0xFFFD 0xFFFD, so the single 0xFFFD and sucessing 0x00XX means always the last byte of the odd array.
There are 4 kinds of use case of the escape character.

  1. 0xFFFD 0xD800 for making a fake surrogate pair for unmatched low surrogates.
  2. 0xFFFD [0xD801-0xDBFF] 0xDC00 for making a fake surrogate pair for unmatched high surrogates [0xD800-0xDBFF] except the 0xD800.
  3. 0xFFFD 0xD801 0xDC01 for meaning the 0xD800 itself used for the fake high surrogate.
  4. 0xFFFD 0x00XX for meaning the array is odd and holding its the last byte as XX.

Note the fake low surrogate 0xDC00 is included in the first case as 0xFFFD 0xD800 0xDC00, so there is no need to extra treatment for that.
As above, all the bytes are encoded into the valid UTF-16.

@genki
Copy link
Contributor Author

genki commented Feb 15, 2024

When encoding the Uint8Array into UTF-16 string, there are 4 parts.

  1. Normal code points. 96.875% of the bytes. They can be used as is.
  2. Unmatched high surrogate 0xDC00-0xDFFF. They need to be paired with something low surrogate.
  3. Unmatched low surrogate 0xD800-0xDBFF. They need to be paired with something high surrogate.
  4. The last byte of the odd length bytes.

@genki
Copy link
Contributor Author

genki commented Feb 15, 2024

Added tests and moved utility functions into a separate file.

@genki genki requested a review from wmertens February 18, 2024 08:33
@genki
Copy link
Contributor Author

genki commented Feb 22, 2024

Fixed the unit test because the TextEncoder drops the BOM character even if that is a valid UTF-16 character.

@genki
Copy link
Contributor Author

genki commented Feb 22, 2024

I came up with the BOM also should be escaped to protect from unexpecting replacement or vanishing by text processors.
So I have changed implementation slightly to do it.
Added special sequence 0xFFFD 0xD801 0xDC02 that means the escaped BOM.

As far as I know, there's no characters in the UTF-16 having undefined treatment other than unmatched surrogates and the BOM.

@wmertens
Copy link
Member

Sorry fixed the true. Yes about for of, see my other comments

@genki
Copy link
Contributor Author

genki commented Feb 26, 2024

@wmertens I see.
The pseudo code seemed lacking treatment of the unmatched low surrogate coming while the surrogate is undefined.
Anyway I will change the implementation after some work.

@wmertens
Copy link
Member

@genki actually no, the low surrogate should be handled in maybe_escape.

@genki
Copy link
Contributor Author

genki commented Feb 26, 2024

Isn't it better the handling of the unmatched low surrogate as the else if clause following the if (surrogate) {?

@genki
Copy link
Contributor Author

genki commented Feb 26, 2024

@wmertens
I have recalled the strange behaviour of for...of for string.
The JavaScript makes the surrogate pair into single code point while iterating by for...of.
So the iteration count may be less than the length of the string if it is including surrogate pairs.
To avoid this, we have to split the string in advance like this.

  for (const s of code.split('')) {
    const c = s.charCodeAt(0);
    if (!escaped) {
      if (c === esc) {

Is it acceptable?

@genki
Copy link
Contributor Author

genki commented Feb 26, 2024

Or, it is needed to do like this.

  for (const s of code) {
    const c = s.charCodeAt(0);
    if (!escaped) {
      if (c === esc) {
        escaped = true;
      } else {
        // normal codepoint
        bytes[j++] = c & 0xff;
        bytes[j++] = c >>> 8;
        if (c >= 0xD800 && c <= 0xDBFF) {
          const d = s.charCodeAt(1);
          bytes[j++] = d & 0xff;
          bytes[j++] = d >>> 8;
        }
      }
      continue;
    }

At any rate, it draws extra computing cost. Does this still have gain over the original implementation?

@genki
Copy link
Contributor Author

genki commented Feb 26, 2024

Updated implementation to use for...of and fixed things.
I think the cost of checking high surrogate is relatively trivial, I have adopted it.

@genki
Copy link
Contributor Author

genki commented Feb 26, 2024

As the treatment of the ESC ESC may feel strange, let me explain in advance.

  • For even length array that is ended with 0xD83F, its last 2 bytes are encoded to ESC ESC as unmatched high surrogate.
  • For odd length array that is ended with 0x7F, its last byte encoded to ESC ESC as the last byte and ESC.
  • There's no way to distinguish them as is, it is need to encode in different way for the ending0xD83F or 0x7f.

So I chose the way to change the encoding of the 0x7f at the end of odd arrays from ESC ESC to 0x100 ESC.

@wmertens
Copy link
Member

Well no, not quite. You can't know if an array is odd or even by looking at the end. You have to walk the array, and if you end with escaped true, then you know that it is odd.

For that you don't need any special esc handling.

@genki
Copy link
Contributor Author

genki commented Feb 27, 2024

@wmertens
Even if we walk the array, we can't distinguish the final 0x007f 0x007f because that both come while escaped = false.
For simplest example:

  1. 0x007f 0x007f: encoded even array of 2 bytes 0xD8 0x3F.
  2. 0x007f 0x007f: encoded odd array of 1 byte 0x7f.

How to distinguish them?

Copy link
Member

@wmertens wmertens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there 😎

const packed = packUint8Array(a);
// 0xD800-0xDFFF = 2048. 2048/65536 = 0.03125
// These are doubled in length, so inflating ratio is about 1.0625
expect((packed.length * 2) / a.length).toBeLessThan(1.07);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also check the after-JSON length? And compare it with a base-64 encoding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the encoding of the HTML document.
If it uses UTF-16, our encoder/decoder reaches the best inflation ratio (IR) 106%.
But it uses UTF-8, it drops to about 147%.

in UTF-8:
1 byte: 0x00-0x7F: IR = 50%, share is 256/65536
2 bytes: 0x80-0x7FF: IR = 100%, share is 1920/65536
3 bytes: 0x800-0xFFFF: IR = 150%, share is (63488 - 2048)/65536 = 61440/65536
4 bytes: 0x10000-0x10FFFF (all surrogate pairs in UTF-16) IR = 100%, 2048/65536
So totally, IR is about 147%

If Qwik omits the support other than the UTF-8, the base64 may be better choice because of the UTF-8 is latin1 friendly encoding. The IR of the base64 encoding is 4/3, it's about 133%.
It may be the best way to choose the encoder/decoder depending on the encoding of HTML document, for example, if it is UTF-8 or latin1, uses base64, otherwize this serializer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result comes from the UTF-8 is not space efficient other than latin1 characters.
In spite of the UTF-8 is common for HTML, using UTF-16 is better choice if the document contains many characters over the 0x7f code points because most of 3 bytes in UTF-8 are 2 bytes in UTF-16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular the Qwik, as of using many serialization that produces special characters, it may be good to set the UTF-16 as default encoding for better performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Qwik only supports utf-8 at the moment, so then base64 would be a better choice :-/.

However, we have 96 safe-ish characters at our disposal that encode to one byte, so we could implement our own base-96. However, here's a very cool package that does UTF-32 encoding and it talks about other encodings, saying that for UTF-8, base-64 is still preferred :-/.

}
c |= b << 8;
low = true;
if (surrogate !== undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do if (surrogate), it will always be truthy if pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

// double the escape character
if (c === ESC) {
code += String.fromCharCode(0x08ff);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@genki I still think this this should send ESC + 0x08ff ?

Also, I think the code will be nicer (and shorter) if you make a maybeEscape(code) function that adds to the output and handles all the escape cases.

Comment on lines 143 to 148
if (escaped) {
// Array is odd-length, remove last byte
return bytes.subarray(0, j - 1);
} else {
return bytes.subarray(0, j);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return bytes.subarray(0, escaped ? j - 1: j)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was wrong. I will fix.

As of my custom, it is preferable not to use function in the tight inner loop. But it may be old fashioned.
You think the optimizer works well?

// the mismatched high or low surrogate with an escaped value.
//
// 0x007f: escape, because it's rare but still only one utf-8 byte.
// To escape itself, use 0x007f 0x08ff (two bytes utf-8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is 4 bytes utf-8

}
if (!low && bytes.length > 0) {
// put the last byte
code += String.fromCharCode(c === ESC ? 0x100 : c, ESC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think you don't need this now

@wmertens
Copy link
Member

BTW, you can see how much this PR adds to the qwik bundle by looking at this line inside the Qwik CI check:
https://github.com/BuilderIO/qwik/actions/runs/8056685930/job/22006261963#step:7:38
(you can also see this when you run pnpm build.core)

For main it's core.min.mjs: { original: '67.6kb', brotli: '23.2kb' }, so currently we're adding 1kb of code, 400 bytes minified. If we reduce repetition by using internal helper functions that can probably go down a little still.

This code will always be shipped, so every byte helps as is multiplied by millions of eventual views :)

@wmertens
Copy link
Member

Even if we walk the array, we can't distinguish the final 0x007f 0x007f because that both come while escaped = false.
For simplest example:

0x007f 0x007f: encoded even array of 2 bytes 0xD8 0x3F.
0x007f 0x007f: encoded odd array of 1 byte 0x7f.
How to distinguish them?

Your second example is wrong, encoding 0x7f should be 0x007f 0x08fe 0x007f.

@wmertens
Copy link
Member

Very interesting: https://blog.kevinalbs.com/base122

@wmertens
Copy link
Member

Ok so the base-122 encoding assumes embedding into HTML directly, and it is smaller, but after gzip encoding, it is larger :-(

So it looks like we should just use base64 encode/decode. Too bad because this was fun code, but the boring way seems to be the most efficient.

@genki
Copy link
Contributor Author

genki commented Feb 27, 2024

Okey, so I will change the implementation to use base64.
Don't worry the code we have made is still usable for serialization to localStorage.

@wmertens
Copy link
Member

Yeah and w3 discourages using UTF-16 for webpages.

@genki
Copy link
Contributor Author

genki commented Feb 27, 2024

Completed :)
Now I know the base64 is not so bad in UTF-8.

Comment on lines 451 to 456
const buf = atob(data);
const array = new Uint8Array(buf.length);
for (let i = 0; i < buf.length; i++) {
array[i] = buf.charCodeAt(i);
}
return array;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you use TextEncoder here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I recalled the TextEncoder and TextDecoder is not usable for this purpose.
The TextDecoder can decode only the encoded array of text.
It is need to use the generic base64 encoder for any binary data.

@genki
Copy link
Contributor Author

genki commented Feb 27, 2024

Fixed. Now uses for...of for performance.

for (const c of v) {
buf += String.fromCharCode(c);
}
return btoa(buf);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make this a tiny bit smaller by doing btoa(buf).replace(/=+$/, '')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I fixed

Copy link
Member

@wmertens wmertens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wmertens wmertens merged commit 27d978b into QwikDev:main Feb 27, 2024
22 checks passed
@genki
Copy link
Contributor Author

genki commented Feb 27, 2024

@wmertens
BTW, I had put the final result here. I had been misunderstanding the treatment of the last byte of the odd length arrays. It was fixed.
https://gist.github.com/genki/e86e4907d0f5ed04340ab0ec55250499

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[✨] Serialize Uint8Array & other TypedArray
2 participants