New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

astral character support in chrome and IE #142

Closed
SheetJSDev opened this Issue Jun 14, 2014 · 8 comments

Comments

Projects
None yet
2 participants
@SheetJSDev
Contributor

SheetJSDev commented Jun 14, 2014

var zip = new JSZip();
zip.file("Hello.txt", "<si><t>馃崳 is ng</t></si>");
var content = zip.generate({type:"blob"});
// see FileSaver.js
saveAs(content, "example.zip");

The character codes in the string are

[60, 115, 105, 62, 60, 116, 62, 55356, 57187, 32, 105, 115, 32, 110, 103, 60, 47, 116, 62, 60, 47, 115, 105, 62]

In firefox, the string is properly written. The equivalent code in nodejs 0.10 is correct. However, in chrome, the content is not correct.

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev

SheetJSDev Jun 14, 2014

Contributor

@uzulla It looks like the original sushi issue stems from this.

Contributor

SheetJSDev commented Jun 14, 2014

@uzulla It looks like the original sushi issue stems from this.

@dduponchel

This comment has been minimized.

Show comment
Hide comment
@dduponchel

dduponchel Jun 14, 2014

Collaborator

I reproduced the issue. On Firefox and Nodejs, the utf8 encoding is correct because we use a TextEncoder or Buffer's constructor (and skip the broken implementation). On #141, I replaced our utf8 implementation with pako's and that gives correct results. Can you confirm that the files on https://github.com/dduponchel/jszip/tree/async_methods_generated_files/dist fix your issue ?

Collaborator

dduponchel commented Jun 14, 2014

I reproduced the issue. On Firefox and Nodejs, the utf8 encoding is correct because we use a TextEncoder or Buffer's constructor (and skip the broken implementation). On #141, I replaced our utf8 implementation with pako's and that gives correct results. Can you confirm that the files on https://github.com/dduponchel/jszip/tree/async_methods_generated_files/dist fix your issue ?

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev

SheetJSDev Jun 14, 2014

Contributor

@dduponchel while you are at it, can you rework the crc32 table so that the values are 32 bit integers?

https://github.com/dduponchel/jszip/blob/async_methods_generated_files/dist/jszip.js#L226-L291

For example, the third element of the table is 0xEE0E612C, which is interpreted as 3993919788 (outside the set of values that can be stored in a stdint int32_t).

pako explicitly computes the table in a way that ensures the results are 32 bit integers. If you don't want to compute the table, the simplest solution may be to append |0 to all of the values in the source

Contributor

SheetJSDev commented Jun 14, 2014

@dduponchel while you are at it, can you rework the crc32 table so that the values are 32 bit integers?

https://github.com/dduponchel/jszip/blob/async_methods_generated_files/dist/jszip.js#L226-L291

For example, the third element of the table is 0xEE0E612C, which is interpreted as 3993919788 (outside the set of values that can be stored in a stdint int32_t).

pako explicitly computes the table in a way that ensures the results are 32 bit integers. If you don't want to compute the table, the simplest solution may be to append |0 to all of the values in the source

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev
Contributor

SheetJSDev commented Jun 14, 2014

@dduponchel dduponchel added the bug label Jun 16, 2014

@dduponchel

This comment has been minimized.

Show comment
Hide comment
@dduponchel

dduponchel Jun 16, 2014

Collaborator

@SheetJSDev nice !
I've also updated the pull request with a new crc32 implementation.

Collaborator

dduponchel commented Jun 16, 2014

@SheetJSDev nice !
I've also updated the pull request with a new crc32 implementation.

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev

SheetJSDev Jun 16, 2014

Contributor

@dduponchel based on some performance tests, https://github.com/SheetJS/js-crc32 is at least 8x faster than pako's CRC32

Contributor

SheetJSDev commented Jun 16, 2014

@dduponchel based on some performance tests, https://github.com/SheetJS/js-crc32 is at least 8x faster than pako's CRC32

This was referenced Jun 17, 2014

@dduponchel

This comment has been minimized.

Show comment
Hide comment
@dduponchel

dduponchel Jun 18, 2014

Collaborator

8x faster ? I get the same speed when I test the two implementations with benchmark.js.

Collaborator

dduponchel commented Jun 18, 2014

8x faster ? I get the same speed when I test the two implementations with benchmark.js.

@Stuk Stuk closed this in #144 Jun 18, 2014

@SheetJSDev

This comment has been minimized.

Show comment
Hide comment
@SheetJSDev

SheetJSDev Jun 18, 2014

Contributor

@dduponchel @Stuk At the callsite, crc32 is sometimes called as string and sometimes called with a buffer/array (to verify, insert console.log(typeof input); at the start of the crc32 function and run the js-xlsx test suite). The existing crc32 implementation addresses the issue by optionally calling charCodeAt, but the pako implementation has no such callback.

If you generate a text file and pass it into jszip, which is what happens in the overwhelming majority of cases with js-xlsx, jszip would have to convert the utf8 string to a buffer in order to proceed with pako. On the other hand, js-crc32 provides specialized functions for these cases (binary string / unicode characters / array+buffer).

Here is the most frequently access callsite and string generation location.

For some real-life perspective, the median file size is 550 characters (there are very small files like [Content_Types].xml and the various OPC relationship files). Nevertheless, here are some tests using benchmark.js and results under node 0.10.29 for various sizes up to 64 MB (for some reason, node 0.11.13 was giving FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory errors above that). This is a modified version of the existing js-crc32 performance test looking at strings and node Buffers: https://gist.github.com/SheetJSDev/a4b4a623a86b82fab08b#file-perf-txt (the test scripts are included in the gist)

The numbers look a little bit different in 0.11.13, but the overall performance is absolutely horrendous. For example, take the case of 255 characters:

### node 0.10.29
+--- buffer (255) ---
+鉁 js-crc32   x 30.60 ops/sec 卤0.22% (54 runs sampled)
+鉁 pako-crc32 x 18.33 ops/sec 卤0.42% (48 runs sampled)
+Fastest is js-crc32
+--- unicode string (255) ---
+鉁 js-crc32   x 11.55 ops/sec 卤0.20% (32 runs sampled)
+鉁 pako-crc32 x 1.93 ops/sec 卤3.39% (7 runs sampled)
Fastest is js-crc32

### node 0.11.13
--- buffer (255) ---
鉁 js-crc32   x 2.75 ops/sec 卤0.37% (13 runs sampled)
鉁 pako-crc32 x 1.69 ops/sec 卤0.86% (8 runs sampled)
Fastest is js-crc32  ,pako-crc32
--- unicode string (255) ---
鉁 js-crc32   x 16.65 ops/sec 卤0.47% (45 runs sampled)
鉁 pako-crc32 x 0.51 ops/sec 卤2.78% (5 runs sampled)
Fastest is js-crc32

P.S.: I think the 0.11.13 regression is related to nodejs/node-v0.x-archive#7633.

Contributor

SheetJSDev commented Jun 18, 2014

@dduponchel @Stuk At the callsite, crc32 is sometimes called as string and sometimes called with a buffer/array (to verify, insert console.log(typeof input); at the start of the crc32 function and run the js-xlsx test suite). The existing crc32 implementation addresses the issue by optionally calling charCodeAt, but the pako implementation has no such callback.

If you generate a text file and pass it into jszip, which is what happens in the overwhelming majority of cases with js-xlsx, jszip would have to convert the utf8 string to a buffer in order to proceed with pako. On the other hand, js-crc32 provides specialized functions for these cases (binary string / unicode characters / array+buffer).

Here is the most frequently access callsite and string generation location.

For some real-life perspective, the median file size is 550 characters (there are very small files like [Content_Types].xml and the various OPC relationship files). Nevertheless, here are some tests using benchmark.js and results under node 0.10.29 for various sizes up to 64 MB (for some reason, node 0.11.13 was giving FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory errors above that). This is a modified version of the existing js-crc32 performance test looking at strings and node Buffers: https://gist.github.com/SheetJSDev/a4b4a623a86b82fab08b#file-perf-txt (the test scripts are included in the gist)

The numbers look a little bit different in 0.11.13, but the overall performance is absolutely horrendous. For example, take the case of 255 characters:

### node 0.10.29
+--- buffer (255) ---
+鉁 js-crc32   x 30.60 ops/sec 卤0.22% (54 runs sampled)
+鉁 pako-crc32 x 18.33 ops/sec 卤0.42% (48 runs sampled)
+Fastest is js-crc32
+--- unicode string (255) ---
+鉁 js-crc32   x 11.55 ops/sec 卤0.20% (32 runs sampled)
+鉁 pako-crc32 x 1.93 ops/sec 卤3.39% (7 runs sampled)
Fastest is js-crc32

### node 0.11.13
--- buffer (255) ---
鉁 js-crc32   x 2.75 ops/sec 卤0.37% (13 runs sampled)
鉁 pako-crc32 x 1.69 ops/sec 卤0.86% (8 runs sampled)
Fastest is js-crc32  ,pako-crc32
--- unicode string (255) ---
鉁 js-crc32   x 16.65 ops/sec 卤0.47% (45 runs sampled)
鉁 pako-crc32 x 0.51 ops/sec 卤2.78% (5 runs sampled)
Fastest is js-crc32

P.S.: I think the 0.11.13 regression is related to nodejs/node-v0.x-archive#7633.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment