Skip to content

upload: fix list command to work for areas with many files#93

Merged
sampierson merged 3 commits intomasterfrom
spierson-uplsrv-list-scaling
Jan 24, 2018
Merged

upload: fix list command to work for areas with many files#93
sampierson merged 3 commits intomasterfrom
spierson-uplsrv-list-scaling

Conversation

@sampierson
Copy link
Copy Markdown
Member

@sampierson sampierson commented Jan 22, 2018

Instead of getting the file list through the Upload API, get file list directly from S3. This means regular listings are very fast and we avoid proxied pagination.

If we are doing long listings, get detail on batches of files from the Upload API, so we don't have to give users read access to files. Batch 100 files at a time, so the UI is fairly response (and we don't blow API gateway timeouts).

Instead of getting the file list from the Upload API, get file list
directly from S3, then calls Upload API to get detail on files if
required.  This way we avoid a proxied pagination.
@sampierson sampierson requested review from kislyuk and ttung January 22, 2018 23:59
@ghost ghost assigned sampierson Jan 22, 2018
@ghost ghost added the code review label Jan 22, 2018
@codecov-io
Copy link
Copy Markdown

codecov-io commented Jan 23, 2018

Codecov Report

Merging #93 into master will increase coverage by 0.04%.
The diff coverage is 88.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #93      +/-   ##
==========================================
+ Coverage   87.91%   87.95%   +0.04%     
==========================================
  Files          28       28              
  Lines         935      955      +20     
==========================================
+ Hits          822      840      +18     
- Misses        113      115       +2
Impacted Files Coverage Δ
hca/upload/cli/list_area_command.py 100% <100%> (ø) ⬆️
hca/upload/s3_agent.py 71.42% <100%> (+3.24%) ⬆️
hca/upload/upload_area.py 98.24% <100%> (+0.37%) ⬆️
hca/upload/upload_area_urn.py 90.9% <50%> (-4.1%) ⬇️
hca/upload/__init__.py 93.33% <66.66%> (-6.67%) ⬇️
hca/upload/api_client.py 91.66% <85.71%> (-8.34%) ⬇️
hca/upload/upload_config.py 100% <0%> (+5.71%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 739d7bf...eb52706. Read the comment docs.

@sampierson sampierson force-pushed the spierson-uplsrv-list-scaling branch from 80f9b9b to 46fd9e9 Compare January 24, 2018 18:23
and save it in project.
@sampierson sampierson force-pushed the spierson-uplsrv-list-scaling branch from 46fd9e9 to c9bb2d5 Compare January 24, 2018 18:27
Copy link
Copy Markdown
Member

@ttung ttung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this looks fine. Just a question about the content type.

Comment thread hca/upload/api_client.py
return response.json()['files']
def files_info(self, area_uuid, file_list):
url = "{api_url_base}/area/{uuid}/files_info".format(api_url_base=self.api_url_base, uuid=area_uuid)
response = requests.put(url, data=(json.dumps(file_list)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is because GET has a lower payload size limit than PUT? Kind of unfortunate... :(

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally using GET, but I discovered that the AWS API Gateway throws away your payload for a GET, the HTTP spec says that a server response to a GET should not vary based on the payload, and as Andrey pointed out, if it does that breaks caching. So in this case we have to abandon RESTfullness and just call it an RPC call :)

Comment thread hca/upload/api_client.py Outdated
response = requests.put(url, data=(json.dumps(file_list)))
if not response.ok:
raise RuntimeError("GET {url} returned {status}, {content}".format(url=url,
status=response.status_code,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend putting the first parameter on the next line, since that ensures you don't need to re-indent all the lines if you change the first line in the future.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Will do.

Comment thread hca/upload/s3_agent.py

def list_bucket_by_page(self, bucket_name, key_prefix):
paginator = self.s3.meta.client.get_paginator('list_objects')
for page in paginator.paginate(Bucket=bucket_name, Prefix=key_prefix, PaginationConfig={'PageSize': 100}):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason that the default page size is not sufficient?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Default page size is 1000. Attempting to get file info on 1000 S3 objects from the Upload Service lambda took a considerable amount of time and was timing out at the API gateway (30s) and sometimes the Lambda timeout too (300s). It also led to very log pauses in what should be a moderately interactive command. Reducing to 100 keeps it within the API gateway timeout and makes the command feel more interactive.

I considered getting 1000 at a time from S3 then looping through that 100 at a time to get file info from the Upload Service but frankly I don't think that optimization is worth the extra code complexity. 99% of the work is talking to the Upload Service anyway.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this makes sense, and thanks for the explanation.

Comment thread test/upload/cli/test_list_area.py Outdated
m.put(mock_url, text='['
'{"name":"file1.fastq.gz",'
'"content_type":"foo/bar",'
'"content_type":"binary/octet-stream dcp-type=data",'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought your format is binary/octet-stream; dcp-type=data (note the semi-colon)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Will fix.

@sampierson
Copy link
Copy Markdown
Member Author

Thank you for the code review!

Copy link
Copy Markdown
Member

@ttung ttung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment. rest lgtm.

Comment thread test/upload/cli/test_list_area.py Outdated

self.assertRegexpMatches(stdout.captured(), "size\s+123")
self.assertRegexpMatches(stdout.captured(), "Content-Type\s+foo/bar")
self.assertRegexpMatches(stdout.captured(), "Content-Type\s+binary/octet-stream dcp-type=data")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this also needs the semicolon.

@sampierson sampierson force-pushed the spierson-uplsrv-list-scaling branch from ac67055 to eb52706 Compare January 24, 2018 23:41
@sampierson sampierson merged commit ab1a386 into master Jan 24, 2018
@ghost ghost removed the code review label Jan 24, 2018
@sampierson sampierson deleted the spierson-uplsrv-list-scaling branch January 24, 2018 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants