feat(qwik-core): Uint8Array serializer #5846

genki · 2024-02-12T21:48:37Z

Overview

Serializer for the Uint8Array.

What is it?

Feature / enhancement
Bug
Docs / tests / types / typos

Description

As the Uint8Array can be used for large data blob, its way is compactness conscious.
It encodes the Uint8Array into the UTF-16 code in 2 bytes-wise.
To achieve the compactness, there ended up being two kinds of Uint8Array, even and odd length ones.
So there are two serializers Uint8ArrayESerializer and Uint8ArrayOSerializer to distinguish them.

Use cases and why

The Uint8Array is widely used for data buffer, credencial data and so on compared to other typed arrays.
Fixes #4416

Checklist:

My code follows the developer guidelines of this project
I have performed a self-review of my own code
I have made corresponding changes to the documentation
Added new tests to cover the fix / functionality

netlify · 2024-02-12T21:48:41Z

👷 Deploy request for qwik-insights pending review.

Visit the deploys page to approve it

Name	Link
🔨 Latest commit	`4f0023c`

wmertens

I like it, but:

it's missing tests
- especially tests for edge cases involving non-printable characters, XSS result strings like <script, unicode modifiers etc.
- You could look at https://github.com/minimaxir/big-list-of-naughty-strings for inspiration of what shouldn't be allowed as a result
Most likely you'll need an escape mechanism?
the code can be deduplicated by moving the functions out and making them into factories depending on odd/even

👍

genki · 2024-02-14T08:16:06Z

@wmertens I see, this has some problems.

wmertens · 2024-02-14T13:49:28Z

@genki don't get me wrong, this could probably be useful so a solidly tested PR would still be welcome.

genki · 2024-02-14T15:30:20Z

@wmertens Thank you. It's just my careless.
I found this code is needing large amount of testing on various environment to check if the unmatched surrogate pairs in the encoded strings are replaced with 0xFFDF.
JavaScript permits the existence of those invalid UTF-16 strings, but no guarantees on its treatment.
This behaviour is depending on browsers or intermediate text processing libraries.
Unfortunately, currently I have no time to do such testing.

genki · 2024-02-14T22:33:45Z

I have fixed the previous implementation to serialize Uint8Array into valid UTF-16 string without unmatched surrogate pairs.
Now the two serializers are into one.
I have tested locally more than 1,000,000 times over random buffers and the estimated redundancy is less than 7% in bytes. (The surrogate pairs are in 0xD800-0xDFFF = 2048. 2048/65536 = 3.125%. As they are escaped and doubled in size, it is around 6.25% increase)
But I am not being familiar with how place the serialization tests to be in the core package.
Where I have to see?
I have thought so called sanitization is out of the responsibility of the serializer module.

wmertens · 2024-02-14T23:05:51Z

Wow that's a lot of handling needed :) are you sure that it's impossible for an even sequence to look like an odd one?

You can add any test files you like.

It's a bit heavy to review for me right now, I'll get back to this later. Notice the lint error btw.

genki · 2024-02-14T23:41:16Z

@wmertens
Yep, I am not hurrying :)
The odd sequence has a marker at the end like0xFFFD, 0x00XX, where the XX is the last byte of the Uint8Array.
I know it is possible to make more space efficient algorithm if we use the bitwise encoding.
But it introduces more computational cost in JavaScript. I think the bytewise approach is neat balance.

…ialize_uint8array

wmertens · 2024-02-15T07:29:04Z

Right but I meant, what if the even array has the same bytes at the end? Or is that sequence reserved by the escaping?

genki · 2024-02-15T09:20:37Z

@wmertens
That is why the escape character 0xFFFD is put in front of the last byte. Every escape character themselves are encoded into doubled as 0xFFFD 0xFFFD, so the single 0xFFFD and sucessing 0x00XX means always the last byte of the odd array.
There are 4 kinds of use case of the escape character.

0xFFFD 0xD800 for making a fake surrogate pair for unmatched low surrogates.
0xFFFD [0xD801-0xDBFF] 0xDC00 for making a fake surrogate pair for unmatched high surrogates [0xD800-0xDBFF] except the 0xD800.
0xFFFD 0xD801 0xDC01 for meaning the 0xD800 itself used for the fake high surrogate.
0xFFFD 0x00XX for meaning the array is odd and holding its the last byte as XX.

Note the fake low surrogate 0xDC00 is included in the first case as 0xFFFD 0xD800 0xDC00, so there is no need to extra treatment for that.
As above, all the bytes are encoded into the valid UTF-16.

genki · 2024-02-15T09:36:21Z

When encoding the Uint8Array into UTF-16 string, there are 4 parts.

Normal code points. 96.875% of the bytes. They can be used as is.
Unmatched high surrogate 0xDC00-0xDFFF. They need to be paired with something low surrogate.
Unmatched low surrogate 0xD800-0xDBFF. They need to be paired with something high surrogate.
The last byte of the odd length bytes.

…_uint8array

…ialize_uint8array

genki · 2024-02-15T11:28:12Z

Added tests and moved utility functions into a separate file.

genki · 2024-02-22T02:06:53Z

Fixed the unit test because the TextEncoder drops the BOM character even if that is a valid UTF-16 character.

genki · 2024-02-22T04:24:38Z

I came up with the BOM also should be escaped to protect from unexpecting replacement or vanishing by text processors.
So I have changed implementation slightly to do it.
Added special sequence 0xFFFD 0xD801 0xDC02 that means the escaped BOM.

As far as I know, there's no characters in the UTF-16 having undefined treatment other than unmatched surrogates and the BOM.

wmertens · 2024-02-26T17:34:05Z

Sorry fixed the true. Yes about for of, see my other comments

genki · 2024-02-26T18:22:06Z

@wmertens I see.
The pseudo code seemed lacking treatment of the unmatched low surrogate coming while the surrogate is undefined.
Anyway I will change the implementation after some work.

wmertens · 2024-02-26T19:16:25Z

@genki actually no, the low surrogate should be handled in maybe_escape.

genki · 2024-02-26T19:47:39Z

Isn't it better the handling of the unmatched low surrogate as the else if clause following the if (surrogate) {?

genki · 2024-02-26T20:42:11Z

@wmertens
I have recalled the strange behaviour of for...of for string.
The JavaScript makes the surrogate pair into single code point while iterating by for...of.
So the iteration count may be less than the length of the string if it is including surrogate pairs.
To avoid this, we have to split the string in advance like this.

  for (const s of code.split('')) {
    const c = s.charCodeAt(0);
    if (!escaped) {
      if (c === esc) {

Is it acceptable?

genki · 2024-02-26T20:54:56Z

Or, it is needed to do like this.

  for (const s of code) {
    const c = s.charCodeAt(0);
    if (!escaped) {
      if (c === esc) {
        escaped = true;
      } else {
        // normal codepoint
        bytes[j++] = c & 0xff;
        bytes[j++] = c >>> 8;
        if (c >= 0xD800 && c <= 0xDBFF) {
          const d = s.charCodeAt(1);
          bytes[j++] = d & 0xff;
          bytes[j++] = d >>> 8;
        }
      }
      continue;
    }

At any rate, it draws extra computing cost. Does this still have gain over the original implementation?

genki · 2024-02-26T21:46:44Z

Updated implementation to use for...of and fixed things.
I think the cost of checking high surrogate is relatively trivial, I have adopted it.

genki · 2024-02-26T22:21:27Z

As the treatment of the ESC ESC may feel strange, let me explain in advance.

For even length array that is ended with 0xD83F, its last 2 bytes are encoded to ESC ESC as unmatched high surrogate.
For odd length array that is ended with 0x7F, its last byte encoded to ESC ESC as the last byte and ESC.
There's no way to distinguish them as is, it is need to encode in different way for the ending0xD83F or 0x7f.

So I chose the way to change the encoding of the 0x7f at the end of odd arrays from ESC ESC to 0x100 ESC.

wmertens · 2024-02-27T05:31:05Z

Well no, not quite. You can't know if an array is odd or even by looking at the end. You have to walk the array, and if you end with escaped true, then you know that it is odd.

For that you don't need any special esc handling.

genki · 2024-02-27T05:39:55Z

@wmertens
Even if we walk the array, we can't distinguish the final 0x007f 0x007f because that both come while escaped = false.
For simplest example:

0x007f 0x007f: encoded even array of 2 bytes 0xD8 0x3F.
0x007f 0x007f: encoded odd array of 1 byte 0x7f.

How to distinguish them?

wmertens

Almost there 😎

wmertens · 2024-02-26T21:59:20Z

packages/qwik/src/core/util/string.unit.ts

+  const packed = packUint8Array(a);
+  // 0xD800-0xDFFF = 2048. 2048/65536 = 0.03125
+  // These are doubled in length, so inflating ratio is about 1.0625
+  expect((packed.length * 2) / a.length).toBeLessThan(1.07);


Shouldn't we also check the after-JSON length? And compare it with a base-64 encoding?

It depends on the encoding of the HTML document.
If it uses UTF-16, our encoder/decoder reaches the best inflation ratio (IR) 106%.
But it uses UTF-8, it drops to about 147%.

in UTF-8:
1 byte: 0x00-0x7F: IR = 50%, share is 256/65536
2 bytes: 0x80-0x7FF: IR = 100%, share is 1920/65536
3 bytes: 0x800-0xFFFF: IR = 150%, share is (63488 - 2048)/65536 = 61440/65536
4 bytes: 0x10000-0x10FFFF (all surrogate pairs in UTF-16) IR = 100%, 2048/65536
So totally, IR is about 147%

If Qwik omits the support other than the UTF-8, the base64 may be better choice because of the UTF-8 is latin1 friendly encoding. The IR of the base64 encoding is 4/3, it's about 133%.
It may be the best way to choose the encoder/decoder depending on the encoding of HTML document, for example, if it is UTF-8 or latin1, uses base64, otherwize this serializer.

This result comes from the UTF-8 is not space efficient other than latin1 characters.
In spite of the UTF-8 is common for HTML, using UTF-16 is better choice if the document contains many characters over the 0x7f code points because most of 3 bytes in UTF-8 are 2 bytes in UTF-16.

In particular the Qwik, as of using many serialization that produces special characters, it may be good to set the UTF-16 as default encoding for better performance.

I think Qwik only supports utf-8 at the moment, so then base64 would be a better choice :-/.

However, we have 96 safe-ish characters at our disposal that encode to one byte, so we could implement our own base-96. However, here's a very cool package that does UTF-32 encoding and it talks about other encodings, saying that for UTF-8, base-64 is still preferred :-/.

wmertens · 2024-02-26T22:04:08Z