upload: fix list command to work for areas with many files by sampierson · Pull Request #93 · HumanCellAtlas/dcp-cli

sampierson · 2018-01-22T23:59:04Z

Instead of getting the file list through the Upload API, get file list directly from S3. This means regular listings are very fast and we avoid proxied pagination.

If we are doing long listings, get detail on batches of files from the Upload API, so we don't have to give users read access to files. Batch 100 files at a time, so the UI is fairly response (and we don't blow API gateway timeouts).

Instead of getting the file list from the Upload API, get file list directly from S3, then calls Upload API to get detail on files if required. This way we avoid a proxied pagination.

codecov-io · 2018-01-23T00:04:40Z

Codecov Report

Merging #93 into master will increase coverage by 0.04%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master      #93      +/-   ##
==========================================
+ Coverage   87.91%   87.95%   +0.04%     
==========================================
  Files          28       28              
  Lines         935      955      +20     
==========================================
+ Hits          822      840      +18     
- Misses        113      115       +2

Impacted Files	Coverage Δ
hca/upload/cli/list_area_command.py	`100% <100%> (ø)`	⬆️
hca/upload/s3_agent.py	`71.42% <100%> (+3.24%)`	⬆️
hca/upload/upload_area.py	`98.24% <100%> (+0.37%)`	⬆️
hca/upload/upload_area_urn.py	`90.9% <50%> (-4.1%)`	⬇️
hca/upload/__init__.py	`93.33% <66.66%> (-6.67%)`	⬇️
hca/upload/api_client.py	`91.66% <85.71%> (-8.34%)`	⬇️
hca/upload/upload_config.py	`100% <0%> (+5.71%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 739d7bf...eb52706. Read the comment docs.

and save it in project.

ttung

Most of this looks fine. Just a question about the content type.

ttung · 2018-01-24T22:34:05Z

-        return response.json()['files']
+    def files_info(self, area_uuid, file_list):
+        url = "{api_url_base}/area/{uuid}/files_info".format(api_url_base=self.api_url_base, uuid=area_uuid)
+        response = requests.put(url, data=(json.dumps(file_list)))


I assume this is because GET has a lower payload size limit than PUT? Kind of unfortunate... :(

I was originally using GET, but I discovered that the AWS API Gateway throws away your payload for a GET, the HTTP spec says that a server response to a GET should not vary based on the payload, and as Andrey pointed out, if it does that breaks caching. So in this case we have to abandon RESTfullness and just call it an RPC call :)

ttung · 2018-01-24T22:38:48Z

+        response = requests.put(url, data=(json.dumps(file_list)))
+        if not response.ok:
+            raise RuntimeError("GET {url} returned {status}, {content}".format(url=url,
+                                                                               status=response.status_code,


recommend putting the first parameter on the next line, since that ensures you don't need to re-indent all the lines if you change the first line in the future.

Good idea. Will do.

ttung · 2018-01-24T22:39:51Z


+    def list_bucket_by_page(self, bucket_name, key_prefix):
+        paginator = self.s3.meta.client.get_paginator('list_objects')
+        for page in paginator.paginate(Bucket=bucket_name, Prefix=key_prefix, PaginationConfig={'PageSize': 100}):


is there a reason that the default page size is not sufficient?

Yes. Default page size is 1000. Attempting to get file info on 1000 S3 objects from the Upload Service lambda took a considerable amount of time and was timing out at the API gateway (30s) and sometimes the Lambda timeout too (300s). It also led to very log pauses in what should be a moderately interactive command. Reducing to 100 keeps it within the API gateway timeout and makes the command feel more interactive.

I considered getting 1000 at a time from S3 then looping through that 100 at a time to get file info from the Upload Service but frankly I don't think that optimization is worth the extra code complexity. 99% of the work is talking to the Upload Service anyway.

Cool, this makes sense, and thanks for the explanation.

ttung · 2018-01-24T22:42:17Z

+            m.put(mock_url, text='['
                                 '{"name":"file1.fastq.gz",'
-                                 '"content_type":"foo/bar",'
+                                 '"content_type":"binary/octet-stream dcp-type=data",'


I thought your format is binary/octet-stream; dcp-type=data (note the semi-colon)

You are correct. Will fix.

sampierson · 2018-01-24T23:38:00Z

Thank you for the code review!

ttung

see comment. rest lgtm.

ttung · 2018-01-24T23:37:32Z


        self.assertRegexpMatches(stdout.captured(), "size\s+123")
-        self.assertRegexpMatches(stdout.captured(), "Content-Type\s+foo/bar")
+        self.assertRegexpMatches(stdout.captured(), "Content-Type\s+binary/octet-stream dcp-type=data")


i assume this also needs the semicolon.

upload: fix list command to work for areas with many files

ec1d1f1

Instead of getting the file list from the Upload API, get file list directly from S3, then calls Upload API to get detail on files if required. This way we avoid a proxied pagination.

sampierson requested review from kislyuk and ttung January 22, 2018 23:59

ghost assigned sampierson Jan 22, 2018

ghost added the code review label Jan 22, 2018

sampierson force-pushed the spierson-uplsrv-list-scaling branch from 80f9b9b to 46fd9e9 Compare January 24, 2018 18:23

Tweak CodeClimate config

c9bb2d5

and save it in project.

sampierson force-pushed the spierson-uplsrv-list-scaling branch from 46fd9e9 to c9bb2d5 Compare January 24, 2018 18:27

ttung reviewed Jan 24, 2018

View reviewed changes

ttung approved these changes Jan 24, 2018

View reviewed changes

Code review tweaks

eb52706

sampierson force-pushed the spierson-uplsrv-list-scaling branch from ac67055 to eb52706 Compare January 24, 2018 23:41

sampierson merged commit ab1a386 into master Jan 24, 2018

ghost removed the code review label Jan 24, 2018

sampierson deleted the spierson-uplsrv-list-scaling branch January 24, 2018 23:57

Conversation

sampierson commented Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Jan 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ttung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sampierson commented Jan 24, 2018

Uh oh!

ttung left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sampierson commented Jan 22, 2018 •

edited

Loading

codecov-io commented Jan 23, 2018 •

edited

Loading