Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): allow multipart uploads for large datasets #384

Conversation

cwetherill-ps
Copy link
Contributor

This attempts to fall back to a multipart upload strategy with presigned
URLs in the event that a dataset is larger than 500MB

@cwetherill-ps
Copy link
Contributor Author

cwetherill-ps commented Apr 7, 2022

mostly not awful, just need to navigate some lingering query param errors:

<Error>
<Code>AuthorizationQueryParametersError</Code>
<Message>
Query-string authentication version 4 requires the X-Amz-Algorithm, X-Amz-Credential, X-Amz-Signature, X-Amz-Date, X-Amz-SignedHeaders, and X-Amz-Expires parameters.
</Message>
<Key>
te6x30gzr/datasets/dswrkyj0ymtibue/versions/glkpp4r/data/asdf.csv
</Key>
<BucketName>bucket</BucketName>
<Resource>
/bucket/te6x30gzr/datasets/dswrkyj0ymtibue/versions/glkpp4r/data/asdf.csv
</Resource>
<RequestId>16E3B91D7F4FEF72</RequestId>
<HostId>669e732c-41ff-4697-98ec-4a8644d1e8f5</HostId>
</Error>

ref. https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html

@cwetherill-ps cwetherill-ps force-pushed the cwetherill/nb-917-bug-uploading-large-datasets-from-cli branch from 610c066 to 5481ffe Compare April 8, 2022 16:28
@cwetherill-ps cwetherill-ps marked this pull request as ready for review April 8, 2022 16:28
@@ -557,24 +560,114 @@ def update_status():

class PutDatasetFilesCommand(BaseDatasetFilesCommand):

@classmethod
def _put(cls, path, url, content_type):
# @classmethod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this breaks anything; I haven't been able to at least

@cwetherill-ps cwetherill-ps force-pushed the cwetherill/nb-917-bug-uploading-large-datasets-from-cli branch from 5481ffe to 8de0ab6 Compare April 8, 2022 16:31
@@ -599,8 +692,13 @@ def _sign_and_put(self, dataset_version_id, pool, results, update_status):

for pre_signed, result in zip(pre_signeds, results):
update_status()
pool.put(self._put, url=pre_signed.url,
path=result['path'], content_type=result['mimetype'])
pool.put(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granted, this isn't really ideal. We're single-threading all parts of an upload in a single worker rather than distributing all N parts among all M workers in the pool. This will result in longer upload times, but that's better than the broken upload we have today. Soooo baby steps.

# less than the part_minsize, AND we want to 1-index
# our range to match what AWS expects for part
# numbers
for part in range(1, (size // part_minsize) + 2):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this'll also add an extra empty part if the upload is exactly divisible by 500MB, which will probably cause an error from AWS due to it being too small. But also 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cost me my day. Use ceil...

This attempts to fall back to a multipart upload strategy with presigned
URLs in the event that a dataset is larger than 500MB
@cwetherill-ps cwetherill-ps force-pushed the cwetherill/nb-917-bug-uploading-large-datasets-from-cli branch from 8de0ab6 to 5b2a78c Compare April 8, 2022 17:38
Copy link

@marquiswashere marquiswashere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we can test multi-threading later haha.

@cwetherill-ps cwetherill-ps merged commit 53d4c87 into master Apr 8, 2022
@cwetherill-ps cwetherill-ps deleted the cwetherill/nb-917-bug-uploading-large-datasets-from-cli branch April 8, 2022 18:26
@PSBOT
Copy link

PSBOT commented Apr 8, 2022

🎉 This PR is included in version 1.11.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants