Add HTTP resume to DCP CLI download. by ttung · Pull Request #101 · HumanCellAtlas/dcp-cli

ttung · 2018-04-02T20:32:52Z

This allows us to do ranged gets rather than restart the entire download.

Verify the resulting download for extra correctness.

Test plan: Start a download, yank the ethernet cable, plug it back in, and watch the download succeed.

ttung · 2018-04-02T20:33:30Z

Successful retry & resume.

[czipa1osx186 (.venv)]:~/hca/data-store-cli:tonytung-retry> hca dss download --bundle-uuid cc93131b-c616-4a4c-93dc-85436fc5f98e --replica aws
INFO:hca:File project.json: Retrieving...
INFO:hca:File project.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/project.json.
INFO:hca:File biomaterial.json: Retrieving...
INFO:hca:File biomaterial.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/biomaterial.json.
INFO:hca:File 22011_1#54_1.fastq.gz: Retrieving...
INFO:hca:File 22011_1#54_1.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_1.fastq.gz: Resuming at 9437184.
INFO:hca:File 22011_1#54_1.fastq.gz: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/22011_1#54_1.fastq.gz.
INFO:hca:File 22011_1#54_2.fastq.gz: Retrieving...
INFO:hca:File 22011_1#54_2.fastq.gz: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/22011_1#54_2.fastq.gz.
INFO:hca:File file.json: Retrieving...
INFO:hca:File file.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/file.json.
INFO:hca:File process.json: Retrieving...
INFO:hca:File process.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/process.json.
INFO:hca:File protocol.json: Retrieving...
INFO:hca:File protocol.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/protocol.json.
INFO:hca:File links.json: Retrieving...
INFO:hca:File links.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/links.json.
{}
[czipa1osx186 (.venv)]:~/hca/data-store-cli:tonytung-retry>

ttung · 2018-04-02T20:41:15Z

maxing out on attempts.

[czipa1osx186 (.venv)]:~/hca/data-store-cli:tonytung-retry> hca dss download --bundle-uuid cc93131b-c616-4a4c-93dc-85436fc5f98e --replica aws
INFO:hca:File project.json: Retrieving...
INFO:hca:File project.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/project.json.
INFO:hca:File biomaterial.json: Retrieving...
INFO:hca:File biomaterial.json: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/biomaterial.json.
INFO:hca:File 22011_1#54_1.fastq.gz: Retrieving...
INFO:hca:File 22011_1#54_1.fastq.gz: GET SUCCEEDED. Stored at cc93131b-c616-4a4c-93dc-85436fc5f98e/22011_1#54_1.fastq.gz.
INFO:hca:File 22011_1#54_2.fastq.gz: Retrieving...
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
INFO:hca:File 22011_1#54_2.fastq.gz: GET FAILED. Attempting to resume.
Traceback (most recent call last):
  File "/Users/ttung/hca/data-store-cli/.venv/bin/hca", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/Users/ttung/hca/data-store-cli/scripts/hca", line 13, in <module>
    cli.main()
  File "/Users/ttung/hca/data-store-cli/hca/cli.py", line 65, in main
    result = parsed_args.entry_point(parsed_args)
  File "/Users/ttung/hca/data-store-cli/hca/util/__init__.py", line 404, in arg_forwarder
    return command(**command_args)
  File "/Users/ttung/hca/data-store-cli/hca/dss/__init__.py", line 55, in download
    'Range': "bytes={}-".format(fh.tell())
  File "/Users/ttung/hca/data-store-cli/hca/util/__init__.py", line 141, in _request
    timeout=1.0,
  File "/Users/ttung/hca/data-store-cli/.venv/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/ttung/hca/data-store-cli/.venv/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/Users/ttung/hca/data-store-cli/.venv/lib/python2.7/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dss.data.humancellatlas.org', port=443): Max retries exceeded with url: /v1/files/93f4c9fd-c0e1-4294-8083-3e214f0e202a?replica=aws (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x103a69a10>: Failed to establish a new connection: [Errno 51] Network is unreachable',))
[czipa1osx186 (.venv)]:~/hca/data-store-cli:tonytung-retry>

codecov-io · 2018-04-02T21:17:44Z

Codecov Report

Merging #101 into tonytung-break will decrease coverage by 1.36%.
The diff coverage is 61.4%.

@@                Coverage Diff                 @@
##           tonytung-break     #101      +/-   ##
==================================================
- Coverage           87.93%   86.56%   -1.37%     
==================================================
  Files                  29       29              
  Lines                 978     1020      +42     
==================================================
+ Hits                  860      883      +23     
- Misses                118      137      +19

Impacted Files	Coverage Δ
hca/util/__init__.py	`91.69% <100%> (ø)`	⬆️
hca/dss/__init__.py	`78.57% <60.71%> (-11.91%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f99e8a8...2f71ade. Read the comment docs.

Bento007 · 2018-04-16T16:16:04Z

+
+                            for chunk in response.iter_content(chunk_size=1024*1024):
+                                if chunk:
+                                    fh.write(chunk)


Use checksummingio.ChecksummingSink to compute the checksum as you download.

ttung · 2018-04-25T18:32:40Z

Modified to use checksumming_io.
Resume at an unexpected location by dropping bytes.

kislyuk · 2018-04-30T23:54:27Z

Is there a way to get the checksummer to only compute sha256? I'm a bit concerned about the extra overhead of the other checksums.

kislyuk · 2018-05-01T00:00:57Z

+
+                                while consume_bytes > 0:
+                                    bytes_to_read = min(consume_bytes, 1024*1024)
+                                    content = response.iter_content(chunk_size=bytes_to_read)


Is there a reason why you use response.iter_content instead of response.raw.read?

It's a documented API. response.raw.read is not.

Also, the implementation of response.iter_content seems to branch on whether the backend is urllib3 or not, and in the case of urllib3, it doesn't even call response.raw.read.

Hmm ok. I think it is documented (http://docs.python-requests.org/en/master/api/#requests.Response.raw and http://urllib3.readthedocs.io/en/latest/reference/index.html#urllib3.response.HTTPResponse.read) and I'm vaguely unhappy with the extra indirection happening here for no obvious benefit, but I guess it's OK.

So as an example, iter_content(..) in requests calls raw.stream(decode_content=True) for urllib3. If I just call read(..), it will not call with decode_content=True.

This allows us to do ranged gets rather than restart the entire download. Verify the resulting download for extra correctness. Test plan: Start a download, yank the ethernet cable, plug it back in, and watch the download succeed.

ttung · 2018-05-01T15:14:01Z

Is there a way to get the checksummer to only compute sha256? I'm a bit concerned about the extra overhead of the other checksums.

updated to just use hashlib.sha256()

kislyuk

LGTM. For reference, the retry management logic is also available via https://github.com/shazow/urllib3/blob/master/urllib3/util/retry.py, for example it's being used here: https://github.com/HumanCellAtlas/dcp-cli/blob/master/hca/dss/__init__.py#L45-L58 - you may want to consider using that instead of rolling your own.

ttung · 2018-05-01T19:50:57Z

LGTM. For reference, the retry management logic is also available via https://github.com/shazow/urllib3/blob/master/urllib3/util/retry.py, for example it's being used here: https://github.com/HumanCellAtlas/dcp-cli/blob/master/hca/dss/__init__.py#L45-L58 - you may want to consider using that instead of rolling your own.

I tried it, but in my crude experimentation with Network Line Conditioner, the number of retries can quickly be exhausted and the transfer fails. My approach is more like TCP, where some data making it through increases forgiveness.

ttung force-pushed the tonytung-retry branch from 82e925e to fa3f1eb Compare April 2, 2018 20:40

ttung force-pushed the tonytung-retry branch from fa3f1eb to fe21518 Compare April 2, 2018 20:48

ttung requested a review from kislyuk April 2, 2018 20:49

ttung mentioned this pull request Apr 3, 2018

Support range requests in the CLI #103

Open

Bento007 reviewed Apr 16, 2018

View reviewed changes

ttung force-pushed the tonytung-retry branch 4 times, most recently from e129216 to b388f1e Compare April 25, 2018 18:32

ttung force-pushed the tonytung-retry branch from b388f1e to cd54143 Compare April 25, 2018 18:58

ttung changed the base branch from master to tonytung-url April 25, 2018 18:58

ttung changed the base branch from tonytung-url to tonytung-break April 25, 2018 19:58

ttung force-pushed the tonytung-retry branch from cd54143 to e5fcfcd Compare April 25, 2018 19:58

ttung force-pushed the tonytung-break branch from 357af3b to f99e8a8 Compare April 25, 2018 23:54

ttung force-pushed the tonytung-retry branch 2 times, most recently from ac2fc54 to aaae6f3 Compare April 25, 2018 23:56

ttung mentioned this pull request Apr 30, 2018

Preview download very slow #115

Closed

kislyuk reviewed May 1, 2018

View reviewed changes

Add HTTP resume to DCP CLI download.

2f71ade

This allows us to do ranged gets rather than restart the entire download. Verify the resulting download for extra correctness. Test plan: Start a download, yank the ethernet cable, plug it back in, and watch the download succeed.

ttung force-pushed the tonytung-retry branch from aaae6f3 to 2f71ade Compare May 1, 2018 15:13

kislyuk approved these changes May 1, 2018

View reviewed changes

ttung changed the base branch from tonytung-break to master May 1, 2018 19:51

ttung merged commit 6127091 into master May 1, 2018

ttung deleted the tonytung-retry branch May 1, 2018 19:51

Conversation

ttung commented Apr 2, 2018

Uh oh!

ttung commented Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ttung commented Apr 2, 2018

Uh oh!

codecov-io commented Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Bento007 Apr 16, 2018

Choose a reason for hiding this comment

Uh oh!

ttung commented Apr 25, 2018

Uh oh!

kislyuk commented Apr 30, 2018

Uh oh!

kislyuk May 1, 2018

Choose a reason for hiding this comment

Uh oh!

ttung May 1, 2018

Choose a reason for hiding this comment

Uh oh!

kislyuk May 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttung May 1, 2018

Choose a reason for hiding this comment

Uh oh!

ttung commented May 1, 2018

Uh oh!

kislyuk left a comment

Choose a reason for hiding this comment

Uh oh!

ttung commented May 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ttung commented Apr 2, 2018 •

edited

Loading

codecov-io commented Apr 2, 2018 •

edited

Loading

kislyuk May 1, 2018 •

edited

Loading