Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress serialized config data with zlib #4465

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jacobtomlinson
Copy link
Member

While working on dask-cloudprovider I've noticed there is a limit to the amount of user_data you can pass to AWS. Among other things, a serialized copy of the local Dask config is passed via the user_data and depending on how much config a user has can result in the AWS API rejecting the call.

dask/dask-cloudprovider#249

In an attempt to mitigate this I've added zlib compression to the config serialization utility methods. From the limited testing I've done locally I've seen a ~60% reduction in size.

>>> from distributed.utils import serialize_for_cli
>>> import dask.config
>>> len(serialize_for_cli(dask.config.global_config))  # Before change
5452
>>> len(serialize_for_cli(dask.config.global_config))  # After change
2256

@quasiben
Copy link
Member

Thanks @jacobtomlinson -- do you think we need to handle cases where zlib is not installed ?

@jacobtomlinson
Copy link
Member Author

@quasiben I was under the impression that zlib was in the standard library?

@quasiben
Copy link
Member

@jacobtomlinson , apologies. It is in the standard library!

@mrocklin
Copy link
Member

mrocklin commented Jan 28, 2021 via email

@jacobtomlinson
Copy link
Member Author

I agree there is probably a better way to pass this config along. This PR is just treating a symptom.

Passing a file as part of a VM creation on any cloud is non-trivial though. I could imagine passing it through some storage like S3, but that adds complexity.

@ZirconCode
Copy link

ZirconCode commented Feb 1, 2021

Coming from here (thanks @quasiben), I can confirm that this PR solved my issue, at least on this side of things there is no more error. However my machine is stuck at Waiting for scheduler to run at ip:port, while on aws it is "shutting down" / terminated, I believe I also have to pull this change into my packer build, but am somewhat at a loss on how to do this.

edit: So I did a bad thing; I launched the AMI created by packer, edited all files /distributed/utils.py in the conda env's as in the merge changes, and created a new AMI from this image. I still get the same behavior as above unfortunately.

@jacobtomlinson
Copy link
Member Author

@ZirconCode glad this PR got you to the next error. Could you raise an issue on dask-cloudprovider about this?

Base automatically changed from master to main March 8, 2021 19:04
@martindurant
Copy link
Member

I was reminded of his issue by https://stackoverflow.com/questions/75815578/starting-ec2cluster-with-dask-cloudprovider . I can't think of any reason NOT to do this, even though one would think crypto data is not very compressible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants