Skip to content

Config multipart chunk size in s3 etag and checksumming buffered reader#2

Merged
mweiden merged 2 commits intomasterfrom
mweiden-dcp-cli-issue149
Aug 2, 2018
Merged

Config multipart chunk size in s3 etag and checksumming buffered reader#2
mweiden merged 2 commits intomasterfrom
mweiden-dcp-cli-issue149

Conversation

@mweiden
Copy link
Copy Markdown
Contributor

@mweiden mweiden commented Aug 1, 2018

@mweiden mweiden requested review from sampierson and ttung August 1, 2018 16:29
@mweiden mweiden force-pushed the mweiden-dcp-cli-issue149 branch 2 times, most recently from b7ea160 to 1b30f0a Compare August 1, 2018 17:23
@codecov-io
Copy link
Copy Markdown

codecov-io commented Aug 1, 2018

Codecov Report

Merging #2 into master will increase coverage by 0.04%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #2      +/-   ##
==========================================
+ Coverage   92.69%   92.73%   +0.04%     
==========================================
  Files          10       10              
  Lines         301      303       +2     
==========================================
+ Hits          279      281       +2     
  Misses         22       22
Impacted Files Coverage Δ
dcplib/checksumming_io/checksumming_sink.py 92.3% <100%> (ø) ⬆️
...ib/checksumming_io/checksumming_buffered_reader.py 91.3% <100%> (ø) ⬆️
dcplib/checksumming_io/s3_etag.py 67.85% <71.42%> (+2.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dbb0ff...6554c47. Read the comment docs.

@mweiden mweiden changed the title Config multipart chunk size in s3 etag and checksumming buffered reader [WIP] Config multipart chunk size in s3 etag and checksumming buffered reader Aug 1, 2018
@mweiden mweiden force-pushed the mweiden-dcp-cli-issue149 branch 2 times, most recently from 9774de6 to bc20de4 Compare August 1, 2018 18:07
@mweiden mweiden changed the title [WIP] Config multipart chunk size in s3 etag and checksumming buffered reader Config multipart chunk size in s3 etag and checksumming buffered reader Aug 1, 2018
@mweiden mweiden force-pushed the mweiden-dcp-cli-issue149 branch from bc20de4 to 2bfd956 Compare August 1, 2018 18:51

class ChecksummingBufferedReader:

def __init__(self, *args, **kwargs):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just make read_file_size a formal parameter?


def __init__(self, *args, **kwargs):
"""
:param read_file_size: optional file size for correctly setting the s3 etag chunk size
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if this is optional, what you probably want to do is to track the total number of bytes sent to ChecksummingBufferedReader. if it would have warranted a larger chunk size, it should throw an exception.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case would it merit a larger chunk size?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end, if the chunk size is unspecified, you could call get_s3_multipart_chunk_size(..) with the number of bytes that got processed. If that is not S3Etag.default_chunk_size, then the checksum calculated is bad.

Or you could just make the parameter required. :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it required. This might be a pain in some cases, but it will make it easier to understand.

@mweiden mweiden force-pushed the mweiden-dcp-cli-issue149 branch 3 times, most recently from 12c6ea2 to 57f346e Compare August 1, 2018 21:59
def __init__(self, *args, **kwargs):
def __init__(self, file_handler, read_chunk_size, *args, **kwargs):
"""
:param read_file_size: optional file size for correctly setting the s3 etag chunk size;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove optional.

Copy link
Copy Markdown
Contributor Author

@mweiden mweiden Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch ✅

"""
:param read_file_size: optional file size for correctly setting the s3 etag chunk size;
defaults to S3Etag default if None
:param file_handler: the file handler to read from
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorder this above the docblock for read_file_size

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread dcplib/checksumming_io/s3_etag.py Outdated
def __init__(self):
default_chunk_size = 64 * 1024 * 1024

def __init__(self, chunk_size=None):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls make chunk_size mandatory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mweiden mweiden force-pushed the mweiden-dcp-cli-issue149 branch from 57f346e to 5dfba2e Compare August 1, 2018 22:59
Copy link
Copy Markdown
Member

@ttung ttung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small fixes to s3_etag pls, but otherwise lgtm.

Comment thread dcplib/checksumming_io/s3_etag.py Outdated
etag_stride = 64 * 1024 * 1024

def __init__(self):
default_chunk_size = 64 * 1024 * 1024
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary any more, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh. Yes. ✅

Comment thread dcplib/checksumming_io/s3_etag.py Outdated
self._etag_bytes = 0
self._etag_parts = []
self._etag_hasher = hashlib.md5()
self._chunk_size = chunk_size or self.default_chunk_size
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just be self._chuck_size = chunk_size

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mweiden mweiden force-pushed the mweiden-dcp-cli-issue149 branch from 5dfba2e to 6554c47 Compare August 2, 2018 15:04
@mweiden mweiden merged commit 574640c into master Aug 2, 2018
@mweiden mweiden deleted the mweiden-dcp-cli-issue149 branch August 2, 2018 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants