Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to decompress after strFromU8 #16

Closed
macobo opened this issue Dec 8, 2020 · 9 comments
Closed

Failing to decompress after strFromU8 #16

macobo opened this issue Dec 8, 2020 · 9 comments

Comments

@macobo
Copy link

macobo commented Dec 8, 2020

Thanks for the library! Been testing it out and ran into an interesting error:

const { strFromU8, zlibSync, strToU8, unzlibSync } = require('fflate')

const str = 'hi'
let compressed, decompressed

// This works
compressed = zlibSync(strToU8(str))
console.log(compressed)

decompressed = strFromU8(unzlibSync(compressed))
console.log(decompressed)

// This as well
compressed = strFromU8(zlibSync(strToU8(str)), true)
console.log(compressed)

decompressed = strFromU8(unzlibSync(strToU8(compressed, true)))
console.log(decompressed)

// This does not
compressed = strFromU8(zlibSync(strToU8(str)))
console.log(compressed)

decompressed = strFromU8(unzlibSync(strToU8(compressed)))
console.log(decompressed)

The output for version 0.4.1 in node.js is:

⟩ node s.js                                                                
Uint8Array(13) [
  120, 156,   1,   2, 0,
  253, 255, 104, 105, 1,
   59,   0, 210
]
hi
x���ýÿhi�;Ò
hi
x�����hi�;�

/home/macobo/yyy/node_modules/fflate/lib/index.js:887
        throw 'invalid zlib data';
        ^
invalid zlib data
(Use `node --trace-uncaught ...` to show where the exception was thrown)

The same seems to occur for other compression functions as well - tested gzipSync, compressSync. Could there be a problem with strFromU8?

@101arrowz
Copy link
Owner

101arrowz commented Dec 8, 2020

Yes, this behavior is known. The issue is that strToU8 works only on UTF-8 strings, and strFromU8 only works on UTF-8 binary data. In fact, it uses the native TextEncoder if available, and if you replace strToU8 with new TextEncoder().encode and strFromU8 with new TextDecoder().decode, you will observe the same issue. This is the reason that the true option exists at all; it's meant to work on data that isn't UTF-8 by decoding character by character instead of handling multibyte Unicode characters. In the case of binary data such as the result of a compression function, you may (read: almost absolutely will) get invalid UTF-8, and that causes failure.

If you'd like to read a bit on what UTF-8 is in binary and how it works under the hood, this RFC may be of interest. Effectively, we need two separate encoders because UTF-8 binary has to be treated differently: individual characters can range from 1 to 4 bytes.

@macobo
Copy link
Author

macobo commented Dec 8, 2020

Thanks for the clarifications - I arrived at pretty much the exact same conclusion here (not all uint8 arrays constitute valid utf-8 sequences).

My usecase was trying to send this data via xhr - encoding the data via latin1 ended up wasting bytes so where I ended up was along the lines of:

// data is the result of e.g. gzipSync

const req = new XMLHttpRequest()
req.open(method, url, true)

req.send(new Blob([data.buffer], { type: 'text/plain' }))

This seemed to work well on both the client and server side.

Hopefully this helps the next fool trying something similar :)

@101arrowz
Copy link
Owner

Yep, that's why I mentioned that binary strings are inefficient in the docs. Perhaps I could make that more clear. Thanks for the question!

By the way, have you considered using the asynchronous API? Since you're using XHR asynchronously anyway, it shouldn't be too hard to implement and would avoid freezing the browser. Of course, that doesn't apply if what you're uploading is only a few kB, but I'm recommending it just in case.

@macobo
Copy link
Author

macobo commented Dec 8, 2020

Yeah - the first time through the readme/docs the purpose of latin1 encoding was unclear so it might be worth giving it another pass. That said - great job with the docs + benchmarks, everything was well laid out and made sense.

I was implementing on top of an existing legacy codebase so using the async api sadly wasn't that simple - but hoping to do so as I have cycles to refactor/simplify.

@101arrowz
Copy link
Owner

Thanks for the feedback, I'll try to work on the docs a bit more for that. If you have a specific format in mind (e.g. an interactive website or something) let me know, I'm quite terrible at documentation so any suggestions are appreciated. I hope fflate works well in your codebase!

@Nantris
Copy link

Nantris commented Oct 1, 2022

Is there any way around this without a 200% increase in compressed size? In my testing, the compressed strings seem the same so far in terms of characters, so I'm a bit confused. I think it should probably be documented in the README that this is an issue, even if it's not specific to only fflate.

Thanks for the great work @101arrowz!

@101arrowz
Copy link
Owner

I'm not entirely sure what you're asking, could you post some sample code?

@Nantris
Copy link

Nantris commented Oct 2, 2022

Sorry for the confusion and thanks for the speedy reply @101arrowz!

In the README it says:

binary strings are incredibly inefficient and tend to double file size, so they're not recommended.

But in my testing, using strFromU8(compressionOutput, true) yields the same number of characters as strFromU8(compressionOutput) - so I'm wondering what I'm overlooking/misunderstanding.


And I was thinking the text in this comment might be better served outside of the code sample, as it seems pretty crucial and it's easy to overlook in the codeblock until you're already a ways into trying to implement fflate:


image

Sorry if any of this doesn't make sense, it's been a long day. Thank you again for your great work!

@101arrowz
Copy link
Owner

But in my testing, using strFromU8(compressionOutput, true) yields the same number of characters as strFromU8(compressionOutput) - so I'm wondering what I'm overlooking/misunderstanding.

Binary strings are really a hack to represent byte arrays with strings. The code points that map to the characters in a binary string correspond to numbers from 0-255 representing the value of the given byte. In other words, binary strings can represent arbitrary binary data.

In contrast, a normal UTF-8 string is just that - a string. It can't represent arbitrary data, only valid Unicode characters.

So basically, if you store/transfer the string you generate with any sort of protocol that encodes strings with Unicode (HTTP, Local Storage, cookies, etc.), a binary string will take more bytes than the actual data it encodes. A binary string is really useful for compatibility with old JS where Uint8Array didn't exist, and binary strings were the only way to represent arbitrary byte sequences. For example btoa and atob operate with binary strings, but if they were to be recreated in the modern JS landscape they'd accept ArrayBuffers instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants