Skip to content

Add HTTP resume to DCP CLI download.#101

Merged
ttung merged 1 commit intomasterfrom
tonytung-retry
May 1, 2018
Merged

Add HTTP resume to DCP CLI download.#101
ttung merged 1 commit intomasterfrom
tonytung-retry

Conversation

@ttung
Copy link
Copy Markdown
Member

@ttung ttung commented Apr 2, 2018

This allows us to do ranged gets rather than restart the entire download.

Verify the resulting download for extra correctness.

Test plan: Start a download, yank the ethernet cable, plug it back in, and watch the download succeed.

@ttung
Copy link
Copy Markdown
Member Author

ttung commented Apr 2, 2018

Successful retry & resume.

[czipa1osx186 (.venv)]:~/hca/data-store-cli:tonytung-retry> hca dss download --bundle-uuid cc93131b-c616-4a4c-93dc-85436fc5f98e --replica aws
INFO:hca:File project.json: Retrieving...
INFO:hca:File project.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/project.json.
INFO:hca:File biomaterial.json: Retrieving...
INFO:hca:File biomaterial.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/biomaterial.json.
INFO:hca:File 22011_1#54_1.fastq.gz: Retrieving...
INFO:hca:File 22011_1#54_1.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_1.fastq.gz: Resuming at 9437184.
INFO:hca:File 22011_1#54_1.fastq.gz: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/22011_1#54_1.fastq.gz.
INFO:hca:File 22011_1#54_2.fastq.gz: Retrieving...
INFO:hca:File 22011_1#54_2.fastq.gz: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/22011_1#54_2.fastq.gz.
INFO:hca:File file.json: Retrieving...
INFO:hca:File file.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/file.json.
INFO:hca:File process.json: Retrieving...
INFO:hca:File process.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/process.json.
INFO:hca:File protocol.json: Retrieving...
INFO:hca:File protocol.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/protocol.json.
INFO:hca:File links.json: Retrieving...
INFO:hca:File links.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/links.json.
{}
[czipa1osx186 (.venv)]:~/hca/data-store-cli:tonytung-retry> 

@ttung
Copy link
Copy Markdown
Member Author

ttung commented Apr 2, 2018

maxing out on attempts.

[czipa1osx186 (.venv)]:~/hca/data-store-cli:tonytung-retry> hca dss download --bundle-uuid cc93131b-c616-4a4c-93dc-85436fc5f98e --replica aws
INFO:hca:File project.json: Retrieving...
INFO:hca:File project.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/project.json.
INFO:hca:File biomaterial.json: Retrieving...
INFO:hca:File biomaterial.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/biomaterial.json.
INFO:hca:File 22011_1#54_1.fastq.gz: Retrieving...
INFO:hca:File 22011_1#54_1.fastq.gz: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/22011_1#54_1.fastq.gz.
INFO:hca:File 22011_1#54_2.fastq.gz: Retrieving...
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
Traceback (most recent call last):
  File "/Users/ttung/hca/data-store-cli/.venv/bin/hca", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/Users/ttung/hca/data-store-cli/scripts/hca", line 13, in <module>
    cli.main()
  File "/Users/ttung/hca/data-store-cli/hca/cli.py", line 65, in main
    result = parsed_args.entry_point(parsed_args)
  File "/Users/ttung/hca/data-store-cli/hca/util/__init__.py", line 404, in arg_forwarder
    return command(**command_args)
  File "/Users/ttung/hca/data-store-cli/hca/dss/__init__.py", line 55, in download
    'Range': "bytes={}-".format(fh.tell())
  File "/Users/ttung/hca/data-store-cli/hca/util/__init__.py", line 141, in _request
    timeout=1.0,
  File "/Users/ttung/hca/data-store-cli/.venv/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/ttung/hca/data-store-cli/.venv/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/Users/ttung/hca/data-store-cli/.venv/lib/python2.7/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dss.data.humancellatlas.org', port=443): Max retries exceeded with url: /v1/files/93f4c9fd-c0e1-4294-8083-3e214f0e202a?replica=aws (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x103a69a10>: Failed to establish a new connection: [Errno 51] Network is unreachable',))
[czipa1osx186 (.venv)]:~/hca/data-store-cli:tonytung-retry> 

@ttung ttung requested a review from kislyuk April 2, 2018 20:49
@codecov-io
Copy link
Copy Markdown

codecov-io commented Apr 2, 2018

Codecov Report

Merging #101 into tonytung-break will decrease coverage by 1.36%.
The diff coverage is 61.4%.

Impacted file tree graph

@@                Coverage Diff                 @@
##           tonytung-break     #101      +/-   ##
==================================================
- Coverage           87.93%   86.56%   -1.37%     
==================================================
  Files                  29       29              
  Lines                 978     1020      +42     
==================================================
+ Hits                  860      883      +23     
- Misses                118      137      +19
Impacted Files Coverage Δ
hca/util/__init__.py 91.69% <100%> (ø) ⬆️
hca/dss/__init__.py 78.57% <60.71%> (-11.91%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f99e8a8...2f71ade. Read the comment docs.

Comment thread hca/dss/__init__.py

for chunk in response.iter_content(chunk_size=1024*1024):
if chunk:
fh.write(chunk)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use checksummingio.ChecksummingSink to compute the checksum as you download.

@ttung ttung force-pushed the tonytung-retry branch 4 times, most recently from e129216 to b388f1e Compare April 25, 2018 18:32
@ttung
Copy link
Copy Markdown
Member Author

ttung commented Apr 25, 2018

Modified to use checksumming_io.
Resume at an unexpected location by dropping bytes.

@ttung ttung changed the base branch from master to tonytung-url April 25, 2018 18:58
@ttung ttung changed the base branch from tonytung-url to tonytung-break April 25, 2018 19:58
@ttung ttung force-pushed the tonytung-retry branch 2 times, most recently from ac2fc54 to aaae6f3 Compare April 25, 2018 23:56
@ttung ttung mentioned this pull request Apr 30, 2018
@kislyuk
Copy link
Copy Markdown
Member

kislyuk commented Apr 30, 2018

Is there a way to get the checksummer to only compute sha256? I'm a bit concerned about the extra overhead of the other checksums.

Comment thread hca/dss/__init__.py

while consume_bytes > 0:
bytes_to_read = min(consume_bytes, 1024*1024)
content = response.iter_content(chunk_size=bytes_to_read)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you use response.iter_content instead of response.raw.read?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a documented API. response.raw.read is not.

Also, the implementation of response.iter_content seems to branch on whether the backend is urllib3 or not, and in the case of urllib3, it doesn't even call response.raw.read.

Copy link
Copy Markdown
Member

@kislyuk kislyuk May 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok. I think it is documented (http://docs.python-requests.org/en/master/api/#requests.Response.raw and http://urllib3.readthedocs.io/en/latest/reference/index.html#urllib3.response.HTTPResponse.read) and I'm vaguely unhappy with the extra indirection happening here for no obvious benefit, but I guess it's OK.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as an example, iter_content(..) in requests calls raw.stream(decode_content=True) for urllib3. If I just call read(..), it will not call with decode_content=True.

This allows us to do ranged gets rather than restart the entire download.

Verify the resulting download for extra correctness.

Test plan: Start a download, yank the ethernet cable, plug it back in, and watch the download succeed.
@ttung ttung force-pushed the tonytung-retry branch from aaae6f3 to 2f71ade Compare May 1, 2018 15:13
@ttung
Copy link
Copy Markdown
Member Author

ttung commented May 1, 2018

Is there a way to get the checksummer to only compute sha256? I'm a bit concerned about the extra overhead of the other checksums.

updated to just use hashlib.sha256()

Copy link
Copy Markdown
Member

@kislyuk kislyuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. For reference, the retry management logic is also available via https://github.com/shazow/urllib3/blob/master/urllib3/util/retry.py, for example it's being used here: https://github.com/HumanCellAtlas/dcp-cli/blob/master/hca/dss/__init__.py#L45-L58 - you may want to consider using that instead of rolling your own.

@ttung
Copy link
Copy Markdown
Member Author

ttung commented May 1, 2018

LGTM. For reference, the retry management logic is also available via https://github.com/shazow/urllib3/blob/master/urllib3/util/retry.py, for example it's being used here: https://github.com/HumanCellAtlas/dcp-cli/blob/master/hca/dss/__init__.py#L45-L58 - you may want to consider using that instead of rolling your own.

I tried it, but in my crude experimentation with Network Line Conditioner, the number of retries can quickly be exhausted and the transfer fails. My approach is more like TCP, where some data making it through increases forgiveness.

@ttung ttung changed the base branch from tonytung-break to master May 1, 2018 19:51
@ttung ttung merged commit 6127091 into master May 1, 2018
@ttung ttung deleted the tonytung-retry branch May 1, 2018 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants