New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 bindings #68
S3 bindings #68
Conversation
Fixed some writing issue. Now done and ready for review. @JelleAalbers |
Though I guess we wait for the automatic code review and Travis.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There are indeed no generic I/O tests, though there are tests for DataDirectory + local file storage. Would be useful to generalize these at some point. Maybe it's a bit tricky for S3 since you'd have to setup a mock S3 provider.
bk = self.backend_key(key_str) | ||
|
||
# Get list of all buckets / (run_id, lineage) | ||
objects_list = self.s3client.list_buckets() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is fine for now, but is there no way to access a bucket by name? The list of all buckets can get rather large. Guess we don't need this once we have a runs db though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use Bucket=self.key
later in the Saver's close, so apparently it is possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, this is the only way to determine with boto3's client API if a Bucket exists. I could go the exception route and try to access then catch exception, but I try not to use Exceptions for normal cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exceptions in normal cases are pythonic (https://docs.quantifiedcode.com/python-anti-patterns/readability/asking_for_permission_instead_of_forgiveness_when_working_with_files.html). But in this case, at least according to https://stackoverflow.com/questions/26871884, it seems you get a generic ClientError which you then have to laboriously parse to ensure you're not catching other errors.
Anyway, it's fine, the run db should take care of locating data in the end.
def _save_chunk(self, data, chunk_info): | ||
filename = '%06d' % chunk_info['chunk_i'] | ||
|
||
with tempfile.SpooledTemporaryFile() as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file-writing code (open SpooledTemporaryfile, write to it, seek 0, upload to S3) is repeated three times, maybe consider adding a new method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, but hard to do. One is download, one is upload JSON, one is upload chunk. Therefore, the only line that is the same is the seek (which may not be needed...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another upload json at the end of close. But if you think avoiding duplication increases complexity too much, just leave it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, within the Saver I can do that
Once this goes into release, I'll switch processing to using it. |
We can now read from S3 storage. Please be aware that you have to have credentials to do this. It uses boto3. See documentation in the PR for more.
The unique strax key (e.g.
170521_0011_raw_records_58340a130a541c95997fd1c442930427b04eac30
) is the bucket name and then objects are chunk filenames. This should look very similar in layout to the files I/O routines.Are there no tests for I/O? Was thinking if I should mock this.