upload: fix list command to work for areas with many files#93
upload: fix list command to work for areas with many files#93sampierson merged 3 commits intomasterfrom
Conversation
Instead of getting the file list from the Upload API, get file list directly from S3, then calls Upload API to get detail on files if required. This way we avoid a proxied pagination.
Codecov Report
@@ Coverage Diff @@
## master #93 +/- ##
==========================================
+ Coverage 87.91% 87.95% +0.04%
==========================================
Files 28 28
Lines 935 955 +20
==========================================
+ Hits 822 840 +18
- Misses 113 115 +2
Continue to review full report at Codecov.
|
80f9b9b to
46fd9e9
Compare
and save it in project.
46fd9e9 to
c9bb2d5
Compare
ttung
left a comment
There was a problem hiding this comment.
Most of this looks fine. Just a question about the content type.
| return response.json()['files'] | ||
| def files_info(self, area_uuid, file_list): | ||
| url = "{api_url_base}/area/{uuid}/files_info".format(api_url_base=self.api_url_base, uuid=area_uuid) | ||
| response = requests.put(url, data=(json.dumps(file_list))) |
There was a problem hiding this comment.
I assume this is because GET has a lower payload size limit than PUT? Kind of unfortunate... :(
There was a problem hiding this comment.
I was originally using GET, but I discovered that the AWS API Gateway throws away your payload for a GET, the HTTP spec says that a server response to a GET should not vary based on the payload, and as Andrey pointed out, if it does that breaks caching. So in this case we have to abandon RESTfullness and just call it an RPC call :)
| response = requests.put(url, data=(json.dumps(file_list))) | ||
| if not response.ok: | ||
| raise RuntimeError("GET {url} returned {status}, {content}".format(url=url, | ||
| status=response.status_code, |
There was a problem hiding this comment.
recommend putting the first parameter on the next line, since that ensures you don't need to re-indent all the lines if you change the first line in the future.
|
|
||
| def list_bucket_by_page(self, bucket_name, key_prefix): | ||
| paginator = self.s3.meta.client.get_paginator('list_objects') | ||
| for page in paginator.paginate(Bucket=bucket_name, Prefix=key_prefix, PaginationConfig={'PageSize': 100}): |
There was a problem hiding this comment.
is there a reason that the default page size is not sufficient?
There was a problem hiding this comment.
Yes. Default page size is 1000. Attempting to get file info on 1000 S3 objects from the Upload Service lambda took a considerable amount of time and was timing out at the API gateway (30s) and sometimes the Lambda timeout too (300s). It also led to very log pauses in what should be a moderately interactive command. Reducing to 100 keeps it within the API gateway timeout and makes the command feel more interactive.
I considered getting 1000 at a time from S3 then looping through that 100 at a time to get file info from the Upload Service but frankly I don't think that optimization is worth the extra code complexity. 99% of the work is talking to the Upload Service anyway.
There was a problem hiding this comment.
Cool, this makes sense, and thanks for the explanation.
| m.put(mock_url, text='[' | ||
| '{"name":"file1.fastq.gz",' | ||
| '"content_type":"foo/bar",' | ||
| '"content_type":"binary/octet-stream dcp-type=data",' |
There was a problem hiding this comment.
I thought your format is binary/octet-stream; dcp-type=data (note the semi-colon)
There was a problem hiding this comment.
You are correct. Will fix.
|
Thank you for the code review! |
|
|
||
| self.assertRegexpMatches(stdout.captured(), "size\s+123") | ||
| self.assertRegexpMatches(stdout.captured(), "Content-Type\s+foo/bar") | ||
| self.assertRegexpMatches(stdout.captured(), "Content-Type\s+binary/octet-stream dcp-type=data") |
There was a problem hiding this comment.
i assume this also needs the semicolon.
ac67055 to
eb52706
Compare
Instead of getting the file list through the Upload API, get file list directly from S3. This means regular listings are very fast and we avoid proxied pagination.
If we are doing long listings, get detail on batches of files from the Upload API, so we don't have to give users read access to files. Batch 100 files at a time, so the UI is fairly response (and we don't blow API gateway timeouts).