-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(qwik-core): Uint8Array serializer #5846
Conversation
👷 Deploy request for qwik-insights pending review.Visit the deploys page to approve it
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, but:
- it's missing tests
- especially tests for edge cases involving non-printable characters, XSS result strings like
<script
, unicode modifiers etc. - You could look at https://github.com/minimaxir/big-list-of-naughty-strings for inspiration of what shouldn't be allowed as a result
- especially tests for edge cases involving non-printable characters, XSS result strings like
- Most likely you'll need an escape mechanism?
- the code can be deduplicated by moving the functions out and making them into factories depending on odd/even
👍
@wmertens I see, this has some problems. |
@genki don't get me wrong, this could probably be useful so a solidly tested PR would still be welcome. |
@wmertens Thank you. It's just my careless. |
I have fixed the previous implementation to serialize |
Wow that's a lot of handling needed :) are you sure that it's impossible for an even sequence to look like an odd one? You can add any test files you like. It's a bit heavy to review for me right now, I'll get back to this later. Notice the lint error btw. |
@wmertens |
Right but I meant, what if the even array has the same bytes at the end? Or is that sequence reserved by the escaping? |
@wmertens
Note the fake low surrogate |
When encoding the
|
Added tests and moved utility functions into a separate file. |
Fixed the unit test because the |
I came up with the BOM also should be escaped to protect from unexpecting replacement or vanishing by text processors. As far as I know, there's no characters in the UTF-16 having undefined treatment other than unmatched surrogates and the BOM. |
Sorry fixed the true. Yes about for of, see my other comments |
@wmertens I see. |
@genki actually no, the low surrogate should be handled in maybe_escape. |
Isn't it better the handling of the unmatched low surrogate as the |
@wmertens for (const s of code.split('')) {
const c = s.charCodeAt(0);
if (!escaped) {
if (c === esc) { Is it acceptable? |
Or, it is needed to do like this. for (const s of code) {
const c = s.charCodeAt(0);
if (!escaped) {
if (c === esc) {
escaped = true;
} else {
// normal codepoint
bytes[j++] = c & 0xff;
bytes[j++] = c >>> 8;
if (c >= 0xD800 && c <= 0xDBFF) {
const d = s.charCodeAt(1);
bytes[j++] = d & 0xff;
bytes[j++] = d >>> 8;
}
}
continue;
} At any rate, it draws extra computing cost. Does this still have gain over the original implementation? |
Updated implementation to use |
As the treatment of the ESC ESC may feel strange, let me explain in advance.
So I chose the way to change the encoding of the |
Well no, not quite. You can't know if an array is odd or even by looking at the end. You have to walk the array, and if you end with For that you don't need any special esc handling. |
@wmertens
How to distinguish them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there 😎
const packed = packUint8Array(a); | ||
// 0xD800-0xDFFF = 2048. 2048/65536 = 0.03125 | ||
// These are doubled in length, so inflating ratio is about 1.0625 | ||
expect((packed.length * 2) / a.length).toBeLessThan(1.07); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also check the after-JSON length? And compare it with a base-64 encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the encoding of the HTML document.
If it uses UTF-16, our encoder/decoder reaches the best inflation ratio (IR) 106%.
But it uses UTF-8, it drops to about 147%.
in UTF-8:
1 byte: 0x00-0x7F: IR = 50%, share is 256/65536
2 bytes: 0x80-0x7FF: IR = 100%, share is 1920/65536
3 bytes: 0x800-0xFFFF: IR = 150%, share is (63488 - 2048)/65536 = 61440/65536
4 bytes: 0x10000-0x10FFFF (all surrogate pairs in UTF-16) IR = 100%, 2048/65536
So totally, IR is about 147%
If Qwik omits the support other than the UTF-8, the base64 may be better choice because of the UTF-8 is latin1 friendly encoding. The IR of the base64 encoding is 4/3, it's about 133%.
It may be the best way to choose the encoder/decoder depending on the encoding of HTML document, for example, if it is UTF-8 or latin1, uses base64, otherwize this serializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result comes from the UTF-8 is not space efficient other than latin1 characters.
In spite of the UTF-8 is common for HTML, using UTF-16 is better choice if the document contains many characters over the 0x7f code points because most of 3 bytes in UTF-8 are 2 bytes in UTF-16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular the Qwik, as of using many serialization that produces special characters, it may be good to set the UTF-16 as default encoding for better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Qwik only supports utf-8 at the moment, so then base64 would be a better choice :-/.
However, we have 96 safe-ish characters at our disposal that encode to one byte, so we could implement our own base-96. However, here's a very cool package that does UTF-32 encoding and it talks about other encodings, saying that for UTF-8, base-64 is still preferred :-/.
} | ||
c |= b << 8; | ||
low = true; | ||
if (surrogate !== undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just do if (surrogate)
, it will always be truthy if pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
// double the escape character | ||
if (c === ESC) { | ||
code += String.fromCharCode(0x08ff); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@genki I still think this this should send ESC + 0x08ff ?
Also, I think the code will be nicer (and shorter) if you make a maybeEscape(code)
function that adds to the output and handles all the escape cases.
if (escaped) { | ||
// Array is odd-length, remove last byte | ||
return bytes.subarray(0, j - 1); | ||
} else { | ||
return bytes.subarray(0, j); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return bytes.subarray(0, escaped ? j - 1: j)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was wrong. I will fix.
As of my custom, it is preferable not to use function in the tight inner loop. But it may be old fashioned.
You think the optimizer works well?
// the mismatched high or low surrogate with an escaped value. | ||
// | ||
// 0x007f: escape, because it's rare but still only one utf-8 byte. | ||
// To escape itself, use 0x007f 0x08ff (two bytes utf-8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually this is 4 bytes utf-8
} | ||
if (!low && bytes.length > 0) { | ||
// put the last byte | ||
code += String.fromCharCode(c === ESC ? 0x100 : c, ESC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think you don't need this now
BTW, you can see how much this PR adds to the qwik bundle by looking at this line inside the Qwik CI check: For main it's This code will always be shipped, so every byte helps as is multiplied by millions of eventual views :) |
Your second example is wrong, encoding 0x7f should be |
Very interesting: https://blog.kevinalbs.com/base122 |
Ok so the base-122 encoding assumes embedding into HTML directly, and it is smaller, but after gzip encoding, it is larger :-( So it looks like we should just use base64 encode/decode. Too bad because this was fun code, but the boring way seems to be the most efficient. |
Okey, so I will change the implementation to use base64. |
Yeah and w3 discourages using UTF-16 for webpages. |
Completed :) |
const buf = atob(data); | ||
const array = new Uint8Array(buf.length); | ||
for (let i = 0; i < buf.length; i++) { | ||
array[i] = buf.charCodeAt(i); | ||
} | ||
return array; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use TextEncoder here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I recalled the TextEncoder and TextDecoder is not usable for this purpose.
The TextDecoder can decode only the encoded array of text.
It is need to use the generic base64 encoder for any binary data.
Fixed. Now uses |
for (const c of v) { | ||
buf += String.fromCharCode(c); | ||
} | ||
return btoa(buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make this a tiny bit smaller by doing btoa(buf).replace(/=+$/, '')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wmertens |
Overview
Serializer for the
Uint8Array
.What is it?
Description
As the
Uint8Array
can be used for large data blob, its way is compactness conscious.It encodes the
Uint8Array
into the UTF-16 code in 2 bytes-wise.To achieve the compactness, there ended up being two kinds of
Uint8Array
, even and odd length ones.So there are two serializers
Uint8ArrayESerializer
andUint8ArrayOSerializer
to distinguish them.Use cases and why
The
Uint8Array
is widely used for data buffer, credencial data and so on compared to other typed arrays.Fixes #4416
Checklist: