Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 bindings #68

Merged
merged 7 commits into from Aug 21, 2018
Merged

S3 bindings #68

merged 7 commits into from Aug 21, 2018

Conversation

tunnell
Copy link
Member

@tunnell tunnell commented Aug 21, 2018

We can now read from S3 storage. Please be aware that you have to have credentials to do this. It uses boto3. See documentation in the PR for more.

The unique strax key (e.g. 170521_0011_raw_records_58340a130a541c95997fd1c442930427b04eac30) is the bucket name and then objects are chunk filenames. This should look very similar in layout to the files I/O routines.

Are there no tests for I/O? Was thinking if I should mock this.

@tunnell
Copy link
Member Author

tunnell commented Aug 21, 2018

Fixed some writing issue. Now done and ready for review. @JelleAalbers

@tunnell
Copy link
Member Author

tunnell commented Aug 21, 2018

Though I guess we wait for the automatic code review and Travis....

Copy link
Member

@JelleAalbers JelleAalbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

There are indeed no generic I/O tests, though there are tests for DataDirectory + local file storage. Would be useful to generalize these at some point. Maybe it's a bit tricky for S3 since you'd have to setup a mock S3 provider.

bk = self.backend_key(key_str)

# Get list of all buckets / (run_id, lineage)
objects_list = self.s3client.list_buckets()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is fine for now, but is there no way to access a bucket by name? The list of all buckets can get rather large. Guess we don't need this once we have a runs db though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use Bucket=self.key later in the Saver's close, so apparently it is possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, this is the only way to determine with boto3's client API if a Bucket exists. I could go the exception route and try to access then catch exception, but I try not to use Exceptions for normal cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions in normal cases are pythonic (https://docs.quantifiedcode.com/python-anti-patterns/readability/asking_for_permission_instead_of_forgiveness_when_working_with_files.html). But in this case, at least according to https://stackoverflow.com/questions/26871884, it seems you get a generic ClientError which you then have to laboriously parse to ensure you're not catching other errors.

Anyway, it's fine, the run db should take care of locating data in the end.

def _save_chunk(self, data, chunk_info):
filename = '%06d' % chunk_info['chunk_i']

with tempfile.SpooledTemporaryFile() as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file-writing code (open SpooledTemporaryfile, write to it, seek 0, upload to S3) is repeated three times, maybe consider adding a new method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, but hard to do. One is download, one is upload JSON, one is upload chunk. Therefore, the only line that is the same is the seek (which may not be needed...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another upload json at the end of close. But if you think avoiding duplication increases complexity too much, just leave it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, within the Saver I can do that

@tunnell tunnell added the enhancement New feature or request label Aug 21, 2018
@tunnell
Copy link
Member Author

tunnell commented Aug 21, 2018

Once this goes into release, I'll switch processing to using it.

@tunnell tunnell merged commit bbf85f1 into master Aug 21, 2018
@tunnell tunnell deleted the s3 branch August 21, 2018 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants