Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need sync compression, not just async #281

Open
ericvergnaud opened this issue Apr 21, 2016 · 30 comments
Open

Need sync compression, not just async #281

ericvergnaud opened this issue Apr 21, 2016 · 30 comments
Labels

Comments

@ericvergnaud
Copy link

Hi,
thanks for this impressive work.
I'm struggling to use jszip in a scenario which requires synchronous execution.
While async calls are great for probably most use cases, they make it impossible to have an in-memory zipped piece of data (in the form of an ArrayBuffer) as a value.
There are many use cases where this is required, such as returning a buffer. This is simply impossible to implement using async functions.
Example:

var datas = ["some data...", "other data...", "etc..."];
var zipped  = zipDatas(datas); // <- this is where we need synchronous execution
doSomethingWithZipped(zipped);

I notice that JSZip.generate was removed in recent releases. Would it be possible to bring it back (together with a strong warning that this is not the preferred approach)?

Eric

@ericvergnaud
Copy link
Author

Hi,
actually I'm able to get close to what I need by using a custom Promise (which executes synchronously).
However, it seems that jszip also relies on 'asap' to postpose various computations.
Would it be possible to make 'asap' customizable the same way Promise is, so it can be replaced by whatever is appropriate?
Happy to provide a PR if you are fine with the idea.
Eric
n.b. I have the same issue with 'loadAsync'

@dduponchel
Copy link
Collaborator

sync vs async: I tried (see the comment of 0ceb14c) but eventually removed the sync methods because it became increasingly hard to make everything work with async input and async output.

asap in external: A Promise is an ES 6 object and we need a polyfill for older browsers. We let users override it because it directly affect the type of the output of JSZip: if someone uses an other Promise implementation (Bluebird for example) or an other Promise polyfill that is incompatible with the one we use, JSZip.external.Promise let him/her fix that.

asap is used internally to defer code execution (in utils.delay). Exposing it in JSZip.external would tie JSZip's internals to this lib (overriding the utils.delay method would solve this issue) but more importantly I strongly suspect that it will be a source of bugs.
The *async methods are designed to be asynchronous, I think you will get hard-to-debug issues with workers' states.

Instead of letting people release Zalgo and make *async methods synchronous, I'd rather look for a robust way to support both sync and async methods. In the mean time, JSZip v2.6.0 still work synchronously.

@ericvergnaud
Copy link
Author

I'll come up with a proposal soon.

@ericvergnaud
Copy link
Author

Hi,
I've submitted a PR which follows the simple observation that asynchronous calls being bound to execute correctly at any point in the future (which includes immediately), they can therefore conceptually be safely executed immediately without breaking any paradigm.
I have not found any bugs for the simple use cases I need (zip a bunch of buffers, or unzip them).
There is indeed the possibility that some bugs may appear when using the new 'sync' wrapper. I'm under the impression that those bugs would also appear anyway with a Promise implementation that does not wait for the end of the current event handling to execute promises.
In that sense, I think that discovering those bugs would be beneficial since they are presumably much easier to fix in a synchronous context than in a async one.
Eric

@graphicore
Copy link

Hi, I want to add some arguments for adding a synchronous mode. because I believe there are more than enough good reasons to have a synchronous mode available. Just stating that

it became increasingly hard to make everything work with async input and async output.

is no good reasoning.

From my perspective there are good use cases where an async model is counter productive or even bad, if not an anti-pattern:

  • In HTML5 Web Workers its totally OK to have long running sync calls.
  • A command line interface (with NodeJS) often does not profit at all from running stuff asynchronously.
  • Adhoc scripting/testing in a repl is so much easier if you can use sync interfaces.
  • When bootstrapping an application it may just be unresponsive waiting time for the user, no matter if sync or async.
  • Some (fringe) cases may make it harder to use async interfaces, like cross language communication (I'm just struggling with a webkit/cocoa bridge thing where it would be just nice not to need a callback in cocoa, instead an immediate return.)

Concerning the performance, async execution is always worse than straight sync execution, simply because all of the overhead the async model requires (making Promises, scheduling, creating and calling callbacks). So, especially the WebWorker and Commandline Interface cases are very good reasons to have a sync execution path.

When it is totally OK, from an application perspective to run a task synchronously and just to wait for its result in place, it is bad to force asynchronous execution. It adds a lot of boilerplate and it changes and complicates the code design, that's when I call it an anti-pattern.
As a library that should be thought of being infrastructure I don't think that jszip should be forcing people into making worse code.

I made a couple of APIs that support both async and sync execution. I build a helper for it (https://github.com/graphicore/obtainJS) which at least helps me organizing the similar and distinct pieces of a async/sync function. I don't want to get you into using it, especially because I think it needs a good amount of overhaul and some more tools to make it broader useful, also it's probably not the fastest possible implementation. However, I'm saying that it is possible without making your project unmaintainable and that its worth the effort.
If you are interested in how I approached this kind of problem, I'd be happy to show you around in some code with mixed API. Maybe, I also could take this as an incentive to make some updates to obtainJS and make it feasible for jszip.

@Rouche
Copy link

Rouche commented Jun 2, 2016

Hello,

I have the exact same scenario as the OP.

The scenario i have is a bit sticky but i need to compress data and store it in memory Synchronously.

@puzrin
Copy link
Contributor

puzrin commented Jun 17, 2016

IMHO sync methods are useful in node only (but i agree those can be useful there). It worth to split entry points for node and for browserified builds (browser property in package.json). That will help to automatically assemble different APIs on top level.

@Rouche
Copy link

Rouche commented Jun 17, 2016

Not only in node. Were on the client side in Angular.

Edit some clarification of the use case:
Compression can be done Async since we already are in a promise chain. We keep the object in storage.
But getters on this object need to be Sync, we cannot transform all the application with async calls everywhere.
Its small enough to not feel browser freeze.

Keeping the unzipped data in memory can end up with data duplication nightmares. So we just keep data in uncompressed state for now.

@graphicore
Copy link

IMHO sync methods are useful in node only

@puzrin: Please explain. We're working here to give good reasons for a sync way to do zip stuff, also in browsers. I'd like to know if you have arguments to make these reasons void or if you just don't care about sync execution in browsers.

@ericvergnaud
Copy link
Author

Sync methods are also useful in browsers in the following situations:

  • the code is executing in a Worker
  • the business logic is such that async code introduces unmanageable complexity while the data being zipped/unzipped is known to be small, so timeouts are not an issue

Le 17 juin 2016 à 09:23, Lasse Fister notifications@github.com a écrit :

IMHO sync methods are useful in node only

@puzrin https://github.com/puzrin: Please explain. We're working here to give good reasons for a sync way to do zip stuff, also in browsers. I'd like to know if you have arguments to make these reasons void or if you just don't care about sync execution in browsers.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub #281 (comment), or mute the thread https://github.com/notifications/unsubscribe/ADLYJAzQ0VndnJfukYhLUZZYlKYdSalTks5qMfb4gaJpZM4IMWZd.

@puzrin
Copy link
Contributor

puzrin commented Jun 17, 2016

@graphicore may be i'm not familiar with some use cases. The requirements of async/sync interfaces is determined by external components of data flow process. If browser has only async interfaces to load/store data, it would be strange to use sync interface under the hood.

@ericvergnaud

  • the code is executing in a Worker

Worker is async (post messaging) by design. You will have to pass/receive data in async way with no choice. So, sync jszip is not strict requirement here.

  • the business logic is such that async code introduces unmanageable complexity while the data being zipped/unzipped is known to be small, so timeouts are not an issue

I agree in general. But final choice depends on all stages of data flow in your system. If the most of data blocks are async, and you try to cheat in single block with sync code, that can be a bad design. Problem with browsers is that data sources are async there (usually, callbacks or Promise).


Well, anyway, i don't intended to troll anyone about "sync vs async" and so on. My point is, that it's technically difficult/impossible to combine such things in one bottle. Working solution could be to split top level API to separate files, because some parts (encoding, for example) are implemented completely different and some features may be not needed.

The reason why i mentioned sync for node.js is because i know obvious use case for it - writing simple shell scripts.

@ericvergnaud
Copy link
Author

@puzrin
Interacting with a Worker is indeed async, but running code within the Worker isn't. So if you need to zip/unzip data within Worker code, you're in trouble.

@puzrin
Copy link
Contributor

puzrin commented Jun 17, 2016

I see at least 2 alternatives of workers use with jszip, modular and without requirement to sync interface:

  1. Outsize. Wrapper around public interface
  2. Insize. Proxy for inflate/deflate/crc.

Did not analyzed in details.

IMHO in api design process, logic goes via such stages:

  1. Collect business requirements (use cases) with mandatory proofs. Examples:
    • "i need to write simple linear shell scripts" - good.
    • "i need sync interface" - not enougth.
    • "i need to avoid browser interface freeze" - good
  2. Try to reduce use cases from (1) to lower level. Example:
    • need Blob support as data source, actual for browsers (modern only?)
    • need binary strings support, actual for ancient browsers if we can't suggest better alternatives via polyfills. Queue reiteration of (1) to clarify use cases.
  3. Recombine result of (2) to non-conflicting groups of API requirements. As @dduponchel said, it's very hard to make everything work via single entry point. I can confirm that from my experience. Example of such approach is here https://github.com/nodeca/probe-image-size. Async API provides guarantee to be effective with any file sizes. Sync API is available for sotuations when tradeoffs about file size and memory are acceptable (depending on use case, that can be or can not be a bad design)

@graphicore
Copy link

The requirements of async/sync interfaces is determined by external components of data flow process.

There's no inherent async external dependency in this thing.

Problem with browsers is that data sources are async there (usually, callbacks or Promise).

That's wrong. I/O is usually better done async, like XMLHttpRequest or newer interfaces are often async only. But, look at a common example for unzipping: Using 'changeHandler' for a <input type="file" />
you can access the file data right on callback. There's no async step involved. And, after I have loaded, let's say a zip-file, asynchronously via a XMLHttpRequest it's there and I can use it. There's no more async needed.
Also, the data source to go into a zip can be my application state, i.e. from memory. Maybe I'm just adding one file at a time and can live well with the time it takes.

The problem here is not that we want to code around an inherently async interface. The problem is a long running thread, and if you ask me, forcing people into this kind of execution is bad design. And we know that it is forced, because just a version ago it ran perfectly fine as a sync call.

Talking about this, I think the way the asynchronicity is created here is not by running any async code, but by chaining sync code via Promises. The promises themselves behave async when they are resolved. That sounds to me like a rather simple thing to turn into sync code again.

"i need sync interface" - not enougth.

Just by neglecting it your "analysis" wont get any more true. There are many good cases where a much simpler sync call will suffice, it is enough to realize this. The other way around is forcing us users into a paradigm that is not justified.

it's very hard to make everything work via single entry point.

I did it often. It's not that hard. You just have to organize your code in the right bite-size. In this case, the whole process could a big generator internally.

The sync case is executed like this:

function syncRunner(generator) {
      var result;
      while(!(result = generator.next()).done);`
      return result.value;
}

To make it async, if you want to do it with promises, call the generators next method in your then method. The Promises wil relove asynchronously.The async case is executed something like this:

function asyncRunner(generator) {
    // note that in this case the first call to generator.next() is executed when asyncRunner
    // is initally called. That's usually not a good thing for async code, because of error handling.
    // I'd make an initial delay from the initial caller, so this implemantation can stay
    // simpler. Or there is something simple we could do in here.
    var result = generator.next();
    if(result.done)
        return Promise.resolve(result.value);
    return Promise.resolve(generator).then(asyncRunner);
}

// run like asyncRunner(myGenerator()).then(onResult);

http://jsbin.com/mefevivuxi/1/edit?js,console

With this architecture, based on generators, we can even use the yield statements to transport progress information, i.e. to update a progress bar or such. This would be a nice feature for the library, without bloating it. It basically comes for free.

Also, the library user could be exposed to the generator directly and fully control how and when delays are needed.

The runAsync, runSync interfaces would be just convenience methods as implemented above.

Note that this is that easy because we have no real async interfaces in the pipeline.

Async API provides guarantee to be effective with any file sizes. Sync API is available for sotuations when tradeoffs about file size and memory are acceptable (depending on use case, that can be or can not be a bad design)

Trade-offs are OK. Nobody expects that we live in a perfect world. Paternalism is not OK.

@puzrin
Copy link
Contributor

puzrin commented Jun 17, 2016

But, look at a common example for unzipping: Using 'changeHandler' for a <input type="file" />
you can access the file data right on callback. There's no async step involved.

AFAIK, you have no sync file access there. Async read via FileReader still required. Also that means a modern browser with typed arrays support (no need for some boring fallbacks like binary strings).

And, after I have loaded, let's say a zip-file, asynchronously via a XMLHttpRequest it's there and I can use it. There's no more async needed.

Yes, that's a often use case. But since XHR* is event-driven, it doesn't looks a problem to extract data in async way and rethrow event. I don't mean that sync method is not needed at all. I mean, in this case it does not seems to add principal difference.

Talking about this, I think the way the asynchronicity is created here is not by running any async code, but by chaining sync code via Promises. The promises themselves behave async when they are resolved. That sounds to me like a rather simple thing to turn into sync code again.

Promises were added not for fun, but because some internal steps like File read in memory effective way are async. I quess, choice was to reject new features or reject sync api as it was in previous version.

The problem here is not that we want to code around an inherently async interface. The problem is a long running thread, and if you ask me, forcing people into this kind of execution is bad design. And we know that it is forced, because just a version ago it ran perfectly fine as a sync call.

That's normal. Because async API was required by new features like huge files streamed processing and not by desire to teach people programming. Sync methods need noticeable work to rewrite top level components "gluer", because it's no combineable with async one. But as everywhere in OSS, nobody prohibits to stay with previous version or do PR :). As far as i see, this issue is still not rejected.

@graphicore
Copy link

AFAIK, you have no sync file access there.

Yeah, you're right. But once you got the content that's over.

Promises were added not for fun, but because some internal steps like File read in memory effective way are async. I quess, choice was to reject new features or reject sync api as it was in previous version.

it was sync before this release, stop speculating please. They where added for no other reason than to make the long process not trigger a "the script it running forever" alert.

Sync methods need noticeable work to rewrite top level components "gluer", because it's no combineable with async one.

The other way around is true. You can't make a sync call from a real async one, but it's easy to make a sync call execute in an async manner, i.e. setTimeout.

do PR

Yeah, I might. But not without consent of the maintainer. Otherwise, without consent, I'd rather fork it an do my own implementation.

@puzrin
Copy link
Contributor

puzrin commented Jun 17, 2016

I don't specilate, i explain what needs of 'sync' API means, as i understant it. It's technically impossible to move forward with big files support without async api. And thechnicaly impossible to provide both with zero efforts, because those have significat difference internally. May be authors have human resources for one task only, but that's not due wish to make your life difficult :)

@graphicore
Copy link

graphicore commented Jun 17, 2016

It's technically impossible to move forward with big files support without async api.

So what calls are responsible for this? And why not let the user decide

both with zero efforts,

nobody wants that

May be authors have human resources for one task only

I repeatedly offered my help. I'm waiting for a dialogue about that.

update: I'm afk from now til sunday. I'll catch up later.

@puzrin
Copy link
Contributor

puzrin commented Jun 17, 2016

So what calls are responsible for this? And why not let the user decide

  • File chunked read (memory, but interface is async even for full read) + blob chunked write (memory).
  • Probably, webworkers support, but not sure, depends on architecture approach.

"Let user decide" = provide different APIs with different functionality. I said about it in the first post, and explained steps how to split methods in optimal way. That includes check, if any code can be shared at all. In example with image size check it was more effective to do 2 independed implementations instead of try to reuse code (those are still in single package for "let users decide" :).

I repeatedly offered my help. I'm waiting for a dialogue about that.

I'm not author of jszip and can't comment. I'm author of deflate/inflate used here, and my activity is usually related to bugs monitoring. We got a bug report for deflate, and i was checking if there are any related issues here. Also posted some replies if i had experience to share.

@jimmywarting
Copy link
Contributor

There you have it!

async function doMagic() {
  let blob = await zip.generateAsync({type: blob()})
  saveAs(blob, filename)
}

Wasn't that hard. There is a way for async code to fell and operate kind of like synchronous code but still doesn't block.
you can also use yield + co/corutine if babel/or browser don't cut it
there is also a grate async class that you can use

@graphicore
Copy link

Thanks @jimmywarting, but this is not about syntactic sugar.

@dduponchel
Copy link
Collaborator

Sorry for the lack of updates on this issue.

JSZip v3 isn't async "just" to avoid the alert "the script it running forever"
(even if it was a big part of it). The v3 also added support for streams, blob
and promises which can't be read synchronously (and that was a big win IMHO).
To do that, the internals of JSZip changed quite a bit (git says
2446 insertions, 1284 deletions on lib/) from "handle a big Uint8Array
in one pass" to "stream it". The async method just accumulates the result.

I guess, choice was to reject new features or reject sync api as it was in previous version.

You're right @puzrin :)

In this case, the whole process could a big generator internally.

@graphicore Generators look nice and I totally see the advantages when reading
something synchronously (Uint8Array for example) but I don't know how well
they mix with async reads (Blob) and streams.
Currently, the producer (either a stream or a function reading the content with
a setImmediate) choose the tempo, not the consumer. Changing this may impact
streams support (especially backpressure). I'm interested in feedbacks if
anyone mixed them !

Note that this is that easy because we have no real async interfaces in the pipeline.

We will with #290 (don't resolve the full blob but read chunks only when
needed). The FileReader needed to do that is asynchronous.

I'm working on a fix for #290 which made implementing sync methods a bit
easier. I created sync / loadSync methods at the top level and split
everything that couldn't support both into their sync/async versions. This is
basically what @puzrin described.
These new methods will have limitations: if the current zip object used an
async source (a blob, a promise, a stream: any source supported since the v3),
they will fail.
I currently have an (almost) working implementation which still need some code
cleanup, bug fixes and a serious work on the documentation to explain all
of this.

@jimmywarting
Copy link
Contributor

btw, you have the sync blob reader inside workers if that counts for something...

@graphicore
Copy link

These new methods will have limitations: if the current zip object used an
async source (a blob, a promise, a stream: any source supported since the v3),
they will fail.

Sounds good to me.

@ericvergnaud
Copy link
Author

Guys,
FYI I have published on npm a jszip-sync package which comprises #284 .
I appreciate this is not the preferred solution, but I happen to need it now, so I can use it in Travis CI.
I'll be happy to switch back to an official implementation once one becomes available.

@photopea
Copy link

So there is still no way to unzip data synchronously, even when I have everything loaded in memory?

@ericvergnaud
Copy link
Author

ericvergnaud commented Apr 27, 2017

There is jszip-sync on npm.

@photopea
Copy link

I wrote my own ZIP reader in 50 lines of code (it supports only Deflate compression through pako.js, but it is enough in my case).

@tillkruss
Copy link

@Stuk: Can we backport jszip-async into this repo?

@kenorb
Copy link

kenorb commented Aug 21, 2022

For sync methods, try fflate as mentioned in GH-753.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants