async / stream support #195

Merged
merged 33 commits into from Sep 13, 2015

Projects

None yet

5 participants

@dduponchel
Collaborator

This pull request replaces #141.
The current API is synchronous : if JSZip takes too much time to finish its task, the page crashes
(it freezes during the task anyway). This commit does a the following :

  • rewrite the code into workers which are asynchronous
  • replace the public methods
  • add nodejs stream support
  • break the compatibility with existing code

Note : I created a branch jszip_v3 which I target : a change this big can't go
to the current v2.

Workers

A worker is like a nodejs stream but with some differences. On the good side :

  • it works on IE 6-9 without any issue / polyfill
  • it weights less than the full dependencies bundled with browserify
  • it forwards errors (no need to declare an error handler EVERYWHERE)

On the bad side :
It adds new code to maintain, fix, optimize.

A chunk is an object with 2 attributes : meta and data. The former is an
object containing anything (percent for example), see each worker for more
details. The latter is the real data (String, Uint8Array, etc).

Public API

I've updated my gh-pages with the latest API documentation.

Getters are replaced by nodeStream(type) and async(type), generate() is replaced by generateNodeStream and generateAsync. The "async" methods return a Promise, the "stream" methods return a Stream3 (from nodejs) if available.
They have an update callback to get access to metadata like the
name of the current file being compressed or the operation progress (with percent).

With this pull request, the API will be :

// generate
zip.generateAsync(...) // return a Promise of the content
zip.generateNodeStream(...) // return the node stream
zip.generateInternalStream(...) // return the internal stream to let the user write glue code

// getters
zip.async("arraybuffer") // return a Promise
zip.nodeStream("arraybuffer") // return a nodejs stream
zip.internalStream(...) // return the internal stream to let the user write glue code

The previous sync methods now throw exceptions.

Nodejs stream support

file(name, data) now accepts a nodejs stream as data. The "nodeStream" methods generates streams.

Breaking changes

The undocumented JSZip.compressions object changed : the object now returns
workers to do the job, the previous methods are not used anymore.

A lot of sync methods has been removed.

dduponchel added some commits Feb 8, 2014
@dduponchel dduponchel allow the JIT compilation of JSZip.base64
The index went too far, causing the optimizer to drop the compiled code.
f4744b8
@dduponchel dduponchel function arrayLikeToString : allow JIT
Because of the try/catch, the JIT compiler can't do its magic.
This patch moves some of the hot code into separate functions : those
functions can be compiled, the try/catch remains in the main loop.
398b648
@dduponchel dduponchel the filename is already read in the local part 7282f0d
@dduponchel dduponchel unicode path/comment : skip a step
When loading a zip file with unicode path/comment, we were combining two costly
operations : read data as (binary) string and then convert the binary string
to a real string. This patch skip the first string conversion.
69c42cc
@dduponchel dduponchel ZipEntry : avoid StringReader if possible 08d15f4
@dduponchel dduponchel optimize the call to file(name)
Optimize the call to file(name) : instead of using filter (in O(n)),
fetch the entry directly (in O(1)).
ba43b9c
@dduponchel dduponchel remove an useless string conversion a031847
@dduponchel dduponchel extract CompressedObject and simplify its code
The previous code used a closure to keep a reference to the compressed
content and pointers to the beginning/end of the file in this content.
Then we had some complicated code to extract the content and handle the
corner cases.
This new version adds two new ("private" with a _) methods to the
ZipObject instances : _compress and _decompress. A ZipObject instance
will have two states : compressed (with a CompressedObject) and
decompressed (with a string, uint8array, etc).
With this patch, no more closure : for each file we create a
sub{string,array} for the data and then create a CompressedObject.
eb73812
@dduponchel dduponchel Inline function call. 68cae7d
@dduponchel dduponchel read signature : don't call readString if possible
The transformation uint8array -> string (to check against the signature)
can be costly if repeated for a lot of entries. This commit uses the
reader to check the signature, speeding things with a Uint8ArrayReader.
1e1a5e8
@dduponchel dduponchel reading zip : skip useless work 228d3c2
@dduponchel dduponchel refactor : create writer/ and reader/ 736df65
@dduponchel dduponchel extra fields : pre-calculate end position 2de59af
@dduponchel dduponchel rework a bit stringifyByChunk 30360ac
@dduponchel dduponchel Rework ZipObject 4d4870e
@dduponchel dduponchel Simplify a bit the `generate()` final conversion 71861e0
@dduponchel dduponchel refactor : move the generate code to generate.js
The separation between the generation of the CompressedObjects and the
actual use of them will help for the next commit, adding generateAsync().
6fa2c9c
@dduponchel dduponchel Move logic into fileAdd.
The "content is empty" check can be done at the insertion of the content
and not when compressing it.
394712e
@dduponchel dduponchel replace the crc32 implementation with pako's d8ab178
@dduponchel dduponchel Rewrite code into workers
This commit addresses the timeout issue. The current API is synchronous : if
JSZip takes too much time to finish its task, the page crashes
(it freezes during the task anyway). This commit does a the following :

- rewrite the code into workers which can be asynchronous
- add the needed public methods
- add nodejs stream support
- break the compatibility with existing code

Workers
-------

A worker is like a nodejs stream but with some differences. On the good side :

- it works on IE 6-9 without any issue / polyfill
- it weights less than the full dependencies bundled with browserify
- it forwards errors (no need to declare an error handler EVERYWHERE)

On the bad side :
To get sync AND async methods on the public API without duplicating a lot of
code, this class has `isSync` attribute and some if/then to choose between
doing stuff now, or using an async callback. It is dangerously close to
releasing Zalgo (see http://blog.izs.me/post/59142742143/designing-apis-for-asynchrony for more).

A chunk is an object with 2 attributes : `meta` and `data`. The former is an
object containing anything (`percent` for example), see each worker for more
details. The latter is the real data (String, Uint8Array, etc).

Public API
----------

Each method generating data (generate, asText, etc) gain a stream sibling :
generateStream, asTextStream, etc.
This will need a solid discussion because I'm not really satified with this.

Nodejs stream support
---------------------

With this commit, `file(name, data)` accepts a nodejs stream as data. It also
adds a `asNodejsStream()` on the StreamHelper.

Breaking changes
----------------

The undocumented JSZip.compressions object changed : the object now returns
workers to do the job, the previous methods are not used anymore.

Not broken yet, but the the `checkCRC32` (when loading a zip file, it
synchronously check the crc32 of every files) will need to be replaced.
0ceb14c
@dduponchel dduponchel Update the Saucelabs dependencies / configuration
This commit updates the saucelabs dependencies to the latest and update
the configuration :

- wait up to 10minutes, IE 6 is slow
- launch max 3 parallel jobs
- add IE6, IE11, safari 7, safari 8
570ed22
@dduponchel dduponchel Update browserify
Grunt-browserify adde in v3.2.0 a `banner` option, use it. Also ignore
the lib/nodejs/* files : they only make sense with streams.
0efa846
This was referenced Feb 17, 2015
@dduponchel dduponchel Fix the nodejs bridge.
When paused, a nodejs stream CAN and WILL emit "end" events.
b4a8b2b
@dduponchel
Collaborator

Keeping the sync methods will be too much troubles, I'll remove them.

That would give

zip.file("content.txt").async("uint8array", function onComplete(){...});
zip.file("content.txt").async({
    type: "uint8array",
    onUpdate: function () {...} // allow progress update notifications
}, function onComplete(){...});

zip.file("content.txt").stream("uint8array").on("data", ...); // our simple stream
zip.file("content.txt").nodeStream("uint8array").on("data", ...); // nodejs stream

zip.generateAsync({...}, function onComplete(){...});
zip.generateStream({...}).on("data", ...);
zip.generateNodeStream({...}).on("data", ...);

The old methods will throw an exception like "This method has been removed in JSZip 3.0, please check the upgrade guide.".

I'll remove the synchronous .load() method too : a sync checkCRC32 option isn't possible anymore and an async method allows more things (loading from a Blob, from a nodejs file descriptor, etc).

For the async operations, ES6 Promises look like the right choice (instead of manually adding a callback) but that means adding more bytes in the build.

@Stuk Does this sound reasonable ?

@andrewvarga

Is this going to be part of the stable release some time? (Or any other async (webworker) support?

@dduponchel
Collaborator

I'm still working on it... When I have enough time. I may have some time this weekend to continue working on the async load.

@andrewvarga

Let me know if I can help, I could actually make it work in a webworker really easily by just adding some wrapper methods on top of jszip.min.js like:

self.onmessage = function(message) {
    var messageData = message.data;
    var command = messageData.command;
    var args = messageData.args;
    var requestId = messageData.requestId;

    switch (command) {
        case "load":
                   jsZip = new JSZip();
            jsZip.load(args[0]);

            self.postMessage({
                requestId: requestId
            });
                       break;
        case "parse":
           // ... etc

This way I didn't need to touch the actual JSZIP code at all, it just allows to call its functions in a webworker.

dduponchel added some commits Apr 5, 2015
@dduponchel dduponchel [breaking change] Remove sync methods, part 1.
This commit removes most of the sync methods (everything but the load
method).
The new getters are :

- zip.async(type) : es6 promises
- zip.stream(type) : nodejs Stream3
- zip.internalStream(type) : our internal stream (useful to create wrappers).

The new generate methods are :
- zip.generateAsync() : es6 promises
- zip.generateStream() : nodejs Stream3
- zip.generateInternalStream() : our internal stream

With the "stream" keyword, everyone will think of the nodejs stream :
it's safer to use "internalStream" for our own.
The old sync versions will throw an exception.

Also fix the stream version we use : we use Stream3, not the stream
offered by the current node runtime.

Also changed in generate : prefer "binarystring" over "string".
That way, JSZip#generate will be consistent with ZipObject#async.
"string" is still supported because a zip as a text don't make sense.
0e79bc1
@dduponchel dduponchel [breaking change] Remove sync methods, part 2.
This commit remove the sync "load()" method. It has been replaced by
"loadAsync()" which returns a Promise.
63635e1
@dduponchel
Collaborator

I've updated the pull request with async methods and I've updated the description. I've updated my gh-pages with the latest API documentation.
The next step will be to merge this with master.

@andrewvarga doesn't importScripts solve this issue ? Making JSZip Worker aware may give a strange API.

@balaclark

Looks great, async is very necessary for large files.

dduponchel added some commits Jul 6, 2015
@dduponchel dduponchel Merge tag 'v2.5.0' into async
JSZip 2.5.0
b9aca03
@dduponchel dduponchel Use setImmediate instead of setTimeout.
setTimeout has 4ms minimum delay while on a recent browser a chunk takes
~0.2ms of processing. removing these 4ms make a huge difference.

Main downside: the polyfill globally declares setImmediate.
954791e
@dduponchel
Collaborator

Up-to-date with v2.5.0 (I moved the branch jszip_v3 to this tag). @Stuk are you ok with this pull request ?

dduponchel added some commits Aug 2, 2015
@dduponchel dduponchel Prefix nodejs stream methods with "node".
Browsers are also adding streams, see [1], [2], [3]. While the spirit is
the same, the API is different from nodejs streams. To avoid future
confusion, I prefer renaming `generateStream` to `generateNodeStream`
and `stream` to `nodeStream` (we already have `nodebuffer` as type).

  [0]: https://streams.spec.whatwg.org/
  [1]: https://blog.wanderview.com/blog/2015/06/19/intent-to-implement-streams-in-firefox/
  [2]: https://www.chromestatus.com/features/5804334163951616
4a066ad
@dduponchel dduponchel Add "text" as an alias of "string".
When I migrated the tests from `asText()` to `async("string")`, I typed
a lot of `async("text")`. I strongly suspect that other users will have
the issue too while migrating.
944d981
@dduponchel dduponchel Fix upgrade_guide example. aa035eb
@dduponchel
Collaborator

I've prefixed stream methods with "node": browsers are also adding streams, see 1, 2, 3. While the spirit is the same, the API is different from nodejs streams. To avoid future confusion, I prefer renaming generateStream to generateNodeStream and stream to nodeStream (we already have nodebuffer as type).

@Stuk Stuk and 1 other commented on an outdated diff Aug 5, 2015
lib/generate.js
+ date : date,
+ comment : file.comment || "",
+ unixPermissions : file.unixPermissions,
+ dosPermissions : file.dosPermissions
+ })
+ .pipe(zipFileWorker);
+ }
+ zipFileWorker.entriesCount = entriesCount;
+ } catch (e) {
+ zipFileWorker.error(e);
+ }
+
+ return zipFileWorker;
+};
+
+/**
@Stuk
Stuk Aug 5, 2015 Owner

Should ZipFileWorker be put in its own file?

@dduponchel
dduponchel Aug 6, 2015 Collaborator

Yes.

@dduponchel
dduponchel Sep 2, 2015 Collaborator

Done.

@Stuk Stuk commented on the diff Aug 5, 2015
lib/nodejs/NodejsStreamInputAdapter.js
+utils.inherits(NodejsStreamInputAdapter, GenericWorker);
+
+/**
+ * Prepare the stream and bind the callbacks on it.
+ * Do this ASAP on node 0.10 ! A lazy binding doesn't always work.
+ * @param {Stream} stream the nodejs stream to use.
+ */
+NodejsStreamInputAdapter.prototype._bindStream = function (stream) {
+ var self = this;
+ this._stream = stream;
+ stream.pause();
+ stream
+ .on("data", function (chunk) {
+ self.push({
+ data: chunk,
+ meta : {
@Stuk
Stuk Aug 5, 2015 Owner

Is this correct? It will always be 0 percent?

@dduponchel
dduponchel Aug 6, 2015 Collaborator

We don't know the total content length that will be emitted by the nodejs stream.
When generating a zip file from streams, the update function of generateAsync() will get abrupt changes in the total percent:

{ currentFile: 'dist/', percent: 0 }
{ currentFile: 'dist/jszip.js', percent: 33.333333333333336 }
[...]
{ currentFile: 'dist/jszip.js', percent: 33.333333333333336 }
{ currentFile: 'dist/jszip.min.js', percent: 66.66666666666667 }
[...]
{ currentFile: 'dist/jszip.min.js', percent: 66.66666666666667 }
{ currentFile: null, percent: 100 }
[...]
{ currentFile: null, percent: 100 }

We get a better "curve" if we have the full content in memory (nodejs Buffer, ArrayBuffer, etc):

{ currentFile: 'dist/', percent: 0 }
{ currentFile: 'dist/jszip.js', percent: 33.333333333333336 }
{ currentFile: 'dist/jszip.js', percent: 34.813265189986616 }
{ currentFile: 'dist/jszip.js', percent: 36.2931970466399 }
[...]
{ currentFile: 'dist/jszip.js', percent: 64.41190232305222 }
{ currentFile: 'dist/jszip.js', percent: 65.8918341797055 }
[...]
{ currentFile: 'dist/jszip.min.js', percent: 94.72415163097831 }
{ currentFile: 'dist/jszip.min.js', percent: 100 }
{ currentFile: null, percent: 100 }
[...]
{ currentFile: null, percent: 100 }

The result is not perfect with streams but it seems good enough.

@Stuk Stuk and 1 other commented on an outdated diff Aug 5, 2015
lib/stream/GenericWorker.js
+ * Add a callback on an event.
+ * @param {String} name the name of the event (data, end, error)
+ * @param {Function} listener the function to call when the event is triggered
+ * @return {GenericWorker} the current object for chainability
+ */
+ on : function (name, listener) {
+ this._listeners[name].push(listener);
+ return this;
+ },
+ /**
+ * Abort the current worker and the upstream workers because of a write
+ * failure (something went wrong downstream and the whole chain explodes).
+ */
+ writeError : function () {
+ this.isFinished = true;
+ this.cleanUp();
@Stuk
Stuk Aug 5, 2015 Owner

Should an error be emitted here?

@dduponchel
dduponchel Aug 6, 2015 Collaborator

This method is called when something goes wrong in the ZipFileWorker. I don't remember why I don't just call end() (or why I don't propagate the end event upward), I'll investigate.

@dduponchel
dduponchel Sep 2, 2015 Collaborator

Done: the errors are now propagated upward and downward to close every workers.

@Stuk Stuk and 1 other commented on an outdated diff Aug 5, 2015
var support = require('./support');
-var compressions = require('./compressions');
-var nodeBuffer = require('./nodeBuffer');
+var nodejsUtils = require('./nodejsUtils');
+
+// "somewhere early in your app; it attaches to the global."
+// Is updating the global scope really a good idea ?
+require("setimmediate2");
@Stuk
Stuk Aug 5, 2015 Owner

There's also asap which is a simple function, and is probably better than updating a global

@dduponchel
dduponchel Aug 6, 2015 Collaborator

Sounds good.

@dduponchel
dduponchel Sep 2, 2015 Collaborator

Done.

dduponchel added some commits Aug 7, 2015
@dduponchel dduponchel Extract ZipFileWorker in its own file.
Split the generate.js file into generate/index.js and
generate/ZipFileWorker.js.
b84be38
@dduponchel dduponchel Workers: propagate errors upward with ".error()".
The error event wasn't sent upward, the parent workers were left paused
(or running).
Also make ".pause()" and ".resume()" return boolean, extending the
default behavior is easier with it.
058d391
@dduponchel dduponchel Replace setimmediate2 with asap.
The module `asap` doesn't update the global scope. The only downside: we
can't cancel the async call anymore. Instead, we now let the async call
happen and return early if the stream is paused or finished.
c286722
@Stuk
Owner
Stuk commented Sep 12, 2015

👍 is this ready to me merged then?

@dduponchel
Collaborator

@Stuk this pull request is ready, I'll continue on #224.

@Stuk Stuk merged commit c1260c8 into Stuk:jszip_v3 Sep 13, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@bitinn bitinn referenced this pull request in bitinn/node-fetch Oct 18, 2015
Closed

Use case for res.arrayBuffer #51

@Krisa
Krisa commented Nov 13, 2015

damned, this is amazing, seriously. v2.5 just failed and crashed after 1 minute or so on few hundreds files weighting between 10KB to 500KB each and this new version just did the whole job in half a second! Unless there are some bugs (which I have not found on my few basic tests), it would make sense it becomes the new official version.

Small info: grunt (which I had to use to produce the dist v3.0) is somehow outdated with its uglify dependency and throws an error. Updating uglify to the latest version simply fixed the problem.

@dduponchel dduponchel referenced this pull request Feb 10, 2016
Closed

Async support #121

@dduponchel dduponchel deleted the dduponchel:async branch Mar 24, 2016
@dduponchel dduponchel referenced this pull request Apr 12, 2016
Merged

Release 3.0.0 #278

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment