Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

astral character support in chrome and IE #142

Closed
SheetJSDev opened this issue Jun 14, 2014 · 8 comments · Fixed by #144
Closed

astral character support in chrome and IE #142

SheetJSDev opened this issue Jun 14, 2014 · 8 comments · Fixed by #144
Labels

Comments

@SheetJSDev
Copy link
Contributor

var zip = new JSZip();
zip.file("Hello.txt", "<si><t>🍣 is ng</t></si>");
var content = zip.generate({type:"blob"});
// see FileSaver.js
saveAs(content, "example.zip");

The character codes in the string are

[60, 115, 105, 62, 60, 116, 62, 55356, 57187, 32, 105, 115, 32, 110, 103, 60, 47, 116, 62, 60, 47, 115, 105, 62]

In firefox, the string is properly written. The equivalent code in nodejs 0.10 is correct. However, in chrome, the content is not correct.

@SheetJSDev
Copy link
Contributor Author

@uzulla It looks like the original sushi issue stems from this.

@dduponchel
Copy link
Collaborator

I reproduced the issue. On Firefox and Nodejs, the utf8 encoding is correct because we use a TextEncoder or Buffer's constructor (and skip the broken implementation). On #141, I replaced our utf8 implementation with pako's and that gives correct results. Can you confirm that the files on https://github.com/dduponchel/jszip/tree/async_methods_generated_files/dist fix your issue ?

@SheetJSDev
Copy link
Contributor Author

@dduponchel while you are at it, can you rework the crc32 table so that the values are 32 bit integers?

https://github.com/dduponchel/jszip/blob/async_methods_generated_files/dist/jszip.js#L226-L291

For example, the third element of the table is 0xEE0E612C, which is interpreted as 3993919788 (outside the set of values that can be stored in a stdint int32_t).

pako explicitly computes the table in a way that ensures the results are 32 bit integers. If you don't want to compute the table, the simplest solution may be to append |0 to all of the values in the source

@SheetJSDev
Copy link
Contributor Author

@dduponchel http://sheetjsdev.github.io/js-xlsx-demo/ works in Chrome, FF, and IE.

@dduponchel dduponchel added the bug label Jun 16, 2014
@dduponchel
Copy link
Collaborator

@SheetJSDev nice !
I've also updated the pull request with a new crc32 implementation.

@SheetJSDev
Copy link
Contributor Author

@dduponchel based on some performance tests, https://github.com/SheetJS/js-crc32 is at least 8x faster than pako's CRC32

This was referenced Jun 17, 2014
@dduponchel
Copy link
Collaborator

8x faster ? I get the same speed when I test the two implementations with benchmark.js.

@Stuk Stuk closed this as completed in #144 Jun 18, 2014
@SheetJSDev
Copy link
Contributor Author

@dduponchel @Stuk At the callsite, crc32 is sometimes called as string and sometimes called with a buffer/array (to verify, insert console.log(typeof input); at the start of the crc32 function and run the js-xlsx test suite). The existing crc32 implementation addresses the issue by optionally calling charCodeAt, but the pako implementation has no such callback.

If you generate a text file and pass it into jszip, which is what happens in the overwhelming majority of cases with js-xlsx, jszip would have to convert the utf8 string to a buffer in order to proceed with pako. On the other hand, js-crc32 provides specialized functions for these cases (binary string / unicode characters / array+buffer).

Here is the most frequently access callsite and string generation location.

For some real-life perspective, the median file size is 550 characters (there are very small files like [Content_Types].xml and the various OPC relationship files). Nevertheless, here are some tests using benchmark.js and results under node 0.10.29 for various sizes up to 64 MB (for some reason, node 0.11.13 was giving FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory errors above that). This is a modified version of the existing js-crc32 performance test looking at strings and node Buffers: https://gist.github.com/SheetJSDev/a4b4a623a86b82fab08b#file-perf-txt (the test scripts are included in the gist)

The numbers look a little bit different in 0.11.13, but the overall performance is absolutely horrendous. For example, take the case of 255 characters:

### node 0.10.29
+--- buffer (255) ---
+✓ js-crc32   x 30.60 ops/sec ±0.22% (54 runs sampled)
+✓ pako-crc32 x 18.33 ops/sec ±0.42% (48 runs sampled)
+Fastest is js-crc32
+--- unicode string (255) ---
+✓ js-crc32   x 11.55 ops/sec ±0.20% (32 runs sampled)
+✓ pako-crc32 x 1.93 ops/sec ±3.39% (7 runs sampled)
Fastest is js-crc32

### node 0.11.13
--- buffer (255) ---
✓ js-crc32   x 2.75 ops/sec ±0.37% (13 runs sampled)
✓ pako-crc32 x 1.69 ops/sec ±0.86% (8 runs sampled)
Fastest is js-crc32  ,pako-crc32
--- unicode string (255) ---
✓ js-crc32   x 16.65 ops/sec ±0.47% (45 runs sampled)
✓ pako-crc32 x 0.51 ops/sec ±2.78% (5 runs sampled)
Fastest is js-crc32

P.S.: I think the 0.11.13 regression is related to nodejs/node-v0.x-archive#7633.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants