Failing to decompress after strFromU8 #16

macobo · 2020-12-08T08:50:01Z

Thanks for the library! Been testing it out and ran into an interesting error:

const { strFromU8, zlibSync, strToU8, unzlibSync } = require('fflate')

const str = 'hi'
let compressed, decompressed

// This works
compressed = zlibSync(strToU8(str))
console.log(compressed)

decompressed = strFromU8(unzlibSync(compressed))
console.log(decompressed)

// This as well
compressed = strFromU8(zlibSync(strToU8(str)), true)
console.log(compressed)

decompressed = strFromU8(unzlibSync(strToU8(compressed, true)))
console.log(decompressed)

// This does not
compressed = strFromU8(zlibSync(strToU8(str)))
console.log(compressed)

decompressed = strFromU8(unzlibSync(strToU8(compressed)))
console.log(decompressed)

The output for version 0.4.1 in node.js is:

⟩ node s.js                                                                
Uint8Array(13) [
  120, 156,   1,   2, 0,
  253, 255, 104, 105, 1,
   59,   0, 210
]
hi
x���ýÿhi�;Ò
hi
x�����hi�;�

/home/macobo/yyy/node_modules/fflate/lib/index.js:887
        throw 'invalid zlib data';
        ^
invalid zlib data
(Use `node --trace-uncaught ...` to show where the exception was thrown)

The same seems to occur for other compression functions as well - tested gzipSync, compressSync. Could there be a problem with strFromU8?

The text was updated successfully, but these errors were encountered:

101arrowz · 2020-12-08T17:48:46Z

Yes, this behavior is known. The issue is that strToU8 works only on UTF-8 strings, and strFromU8 only works on UTF-8 binary data. In fact, it uses the native TextEncoder if available, and if you replace strToU8 with new TextEncoder().encode and strFromU8 with new TextDecoder().decode, you will observe the same issue. This is the reason that the true option exists at all; it's meant to work on data that isn't UTF-8 by decoding character by character instead of handling multibyte Unicode characters. In the case of binary data such as the result of a compression function, you may (read: almost absolutely will) get invalid UTF-8, and that causes failure.

If you'd like to read a bit on what UTF-8 is in binary and how it works under the hood, this RFC may be of interest. Effectively, we need two separate encoders because UTF-8 binary has to be treated differently: individual characters can range from 1 to 4 bytes.

macobo · 2020-12-08T17:56:23Z

Thanks for the clarifications - I arrived at pretty much the exact same conclusion here (not all uint8 arrays constitute valid utf-8 sequences).

My usecase was trying to send this data via xhr - encoding the data via latin1 ended up wasting bytes so where I ended up was along the lines of:

// data is the result of e.g. gzipSync

const req = new XMLHttpRequest()
req.open(method, url, true)

req.send(new Blob([data.buffer], { type: 'text/plain' }))

This seemed to work well on both the client and server side.

Hopefully this helps the next fool trying something similar :)

101arrowz · 2020-12-08T18:03:48Z

Yep, that's why I mentioned that binary strings are inefficient in the docs. Perhaps I could make that more clear. Thanks for the question!

By the way, have you considered using the asynchronous API? Since you're using XHR asynchronously anyway, it shouldn't be too hard to implement and would avoid freezing the browser. Of course, that doesn't apply if what you're uploading is only a few kB, but I'm recommending it just in case.

macobo · 2020-12-08T18:08:40Z

Yeah - the first time through the readme/docs the purpose of latin1 encoding was unclear so it might be worth giving it another pass. That said - great job with the docs + benchmarks, everything was well laid out and made sense.

I was implementing on top of an existing legacy codebase so using the async api sadly wasn't that simple - but hoping to do so as I have cycles to refactor/simplify.

101arrowz · 2020-12-08T18:12:53Z

Thanks for the feedback, I'll try to work on the docs a bit more for that. If you have a specific format in mind (e.g. an interactive website or something) let me know, I'm quite terrible at documentation so any suggestions are appreciated. I hope fflate works well in your codebase!

Nantris · 2022-10-01T23:15:30Z

Is there any way around this without a 200% increase in compressed size? In my testing, the compressed strings seem the same so far in terms of characters, so I'm a bit confused. I think it should probably be documented in the README that this is an issue, even if it's not specific to only fflate.

Thanks for the great work @101arrowz!

101arrowz · 2022-10-01T23:55:06Z

I'm not entirely sure what you're asking, could you post some sample code?

Nantris · 2022-10-02T00:08:08Z

Sorry for the confusion and thanks for the speedy reply @101arrowz!

In the README it says:

binary strings are incredibly inefficient and tend to double file size, so they're not recommended.

But in my testing, using strFromU8(compressionOutput, true) yields the same number of characters as strFromU8(compressionOutput) - so I'm wondering what I'm overlooking/misunderstanding.

And I was thinking the text in this comment might be better served outside of the code sample, as it seems pretty crucial and it's easy to overlook in the codeblock until you're already a ways into trying to implement fflate:

Sorry if any of this doesn't make sense, it's been a long day. Thank you again for your great work!

101arrowz · 2022-10-02T02:18:50Z

But in my testing, using strFromU8(compressionOutput, true) yields the same number of characters as strFromU8(compressionOutput) - so I'm wondering what I'm overlooking/misunderstanding.

Binary strings are really a hack to represent byte arrays with strings. The code points that map to the characters in a binary string correspond to numbers from 0-255 representing the value of the given byte. In other words, binary strings can represent arbitrary binary data.

In contrast, a normal UTF-8 string is just that - a string. It can't represent arbitrary data, only valid Unicode characters.

So basically, if you store/transfer the string you generate with any sort of protocol that encodes strings with Unicode (HTTP, Local Storage, cookies, etc.), a binary string will take more bytes than the actual data it encodes. A binary string is really useful for compatibility with old JS where Uint8Array didn't exist, and binary strings were the only way to represent arbitrary byte sequences. For example btoa and atob operate with binary strings, but if they were to be recreated in the modern JS landscape they'd accept ArrayBuffers instead.

101arrowz closed this as completed Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing to decompress after strFromU8 #16

Failing to decompress after strFromU8 #16

macobo commented Dec 8, 2020

101arrowz commented Dec 8, 2020 •

edited

macobo commented Dec 8, 2020

101arrowz commented Dec 8, 2020

macobo commented Dec 8, 2020

101arrowz commented Dec 8, 2020

Nantris commented Oct 1, 2022 •

edited

101arrowz commented Oct 1, 2022

Nantris commented Oct 2, 2022 •

edited

101arrowz commented Oct 2, 2022

Failing to decompress after strFromU8 #16

Failing to decompress after strFromU8 #16

Comments

macobo commented Dec 8, 2020

101arrowz commented Dec 8, 2020 • edited

macobo commented Dec 8, 2020

101arrowz commented Dec 8, 2020

macobo commented Dec 8, 2020

101arrowz commented Dec 8, 2020

Nantris commented Oct 1, 2022 • edited

101arrowz commented Oct 1, 2022

Nantris commented Oct 2, 2022 • edited

101arrowz commented Oct 2, 2022

101arrowz commented Dec 8, 2020 •

edited

Nantris commented Oct 1, 2022 •

edited

Nantris commented Oct 2, 2022 •

edited