New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): allow multipart uploads for large datasets #384
feat(datasets): allow multipart uploads for large datasets #384
Conversation
mostly not awful, just need to navigate some lingering query param errors:
ref. https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html |
610c066
to
5481ffe
Compare
@@ -557,24 +560,114 @@ def update_status(): | |||
|
|||
class PutDatasetFilesCommand(BaseDatasetFilesCommand): | |||
|
|||
@classmethod | |||
def _put(cls, path, url, content_type): | |||
# @classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this breaks anything; I haven't been able to at least
5481ffe
to
8de0ab6
Compare
@@ -599,8 +692,13 @@ def _sign_and_put(self, dataset_version_id, pool, results, update_status): | |||
|
|||
for pre_signed, result in zip(pre_signeds, results): | |||
update_status() | |||
pool.put(self._put, url=pre_signed.url, | |||
path=result['path'], content_type=result['mimetype']) | |||
pool.put( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Granted, this isn't really ideal. We're single-threading all parts of an upload in a single worker rather than distributing all N parts among all M workers in the pool. This will result in longer upload times, but that's better than the broken upload we have today. Soooo baby steps.
# less than the part_minsize, AND we want to 1-index | ||
# our range to match what AWS expects for part | ||
# numbers | ||
for part in range(1, (size // part_minsize) + 2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this'll also add an extra empty part if the upload is exactly divisible by 500MB, which will probably cause an error from AWS due to it being too small. But also 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cost me my day. Use ceil
...
This attempts to fall back to a multipart upload strategy with presigned URLs in the event that a dataset is larger than 500MB
8de0ab6
to
5b2a78c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we can test multi-threading later haha.
🎉 This PR is included in version 1.11.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This attempts to fall back to a multipart upload strategy with presigned
URLs in the event that a dataset is larger than 500MB