-
Notifications
You must be signed in to change notification settings - Fork 0
Add CLI command to sync data between two locations #188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 16680188437Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looking real good.
Requesting changes for the question about the config property s3_bucket_sync_source and whether it should raise an exception if that env var is not set.
Additionally, I think we might benefit from a couple error situation tests in test_cli.py for this new sync command? A couple possible ones:
- env var
S3_BUCKET_SYNC_SOURCEis not set, but explicitsourceis not provided - there is nothing at the S3 source to sync
Number 2 above might be tricky... maybe not needed. To test that, that might suggest what's actually needed is some checking of the source before syncing to see if there's something actually there.
If we do think that's worthwhile, then a test when nothing is there would be helpful.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! A few comments in addition to seconding @ghukill 's comments
dsc/cli.py
Outdated
| @click.option( | ||
| "--endpoint-url", | ||
| help=( | ||
| "Specifies a custom endpoint URL to which which the AWS CLI sends S3 requests " | ||
| "instead of the default AWS S3 service endpoint" | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this mostly just for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this question prompted me to determine whether setting AWS_ENDPOINT_URL as an environment variable is sufficient and I believe it is!
I've updated the sync CLI command and updated tests/test_cli.py::test_sync_success to set the AWS_ENDPOINT_URL env var using monkeypatch.
| If 'source' and 'destination' are not provided, the method will derive values | ||
| based on the required '--batch-id / -b' and 'workflow-name / -w' options and | ||
| S3 bucket env vars: | ||
| * source: batch path in S3_BUCKET_SYNC_SOURCE | ||
| * destination: batch path in S3_BUCKET_SUBMISSION_ASSETS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is unnecessary flexibility, is there actually a situation where we would do something other than workflow_name/batch_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While these options will not be used in production, I do think it can be handy for cases where we're running DSC locally or using a local file system--for testing or otherwise! What's important is that source and destination are configured as required=False.
dsc/cli.py
Outdated
| optional_args = [] | ||
| if dry_run: | ||
| optional_args.append("--dryrun") | ||
| if endpoint_url: | ||
| optional_args.extend(["--endpoint-url", endpoint_url]) | ||
|
|
||
| args.extend(optional_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't these 2 args just be appended/extended to args directly without the optional_args var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. I will update to use the args variable and add a comment to flag optional args instead.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! With one clarifying but optional comment
dsc/config.py
Outdated
| if not value: | ||
| raise OSError("Env var 'S3_BUCKET_SUBMISSION_ASSETS' must be defined") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Why these changes are being introduced: DSC supports submissions to both test and prod instances of MITL's DSpace repositories. Deposits to test are performed in the Dev and Stage environments, while deposits to prod are performed in the Prod environment. The required data must be present in, or uploaded to, the appropriate S3 buckets for each environment. This method provides an easy way to sync the data between two S3 buckets using the AWS CLI 'aws s3 sync' command. How this addresses that need: * Add 'sync' CLI command * Add optional env var 'S3_BUCKET_SYNC_SOURCE' * Update Dockerfile to include install of AWS CLI (and its dependencies) * Add 'sync' unit test with moto_server Side effects of this change: * In the case of DSC workflows, the command will be used to sync data from an S3 bucket to (an S3 bucket) in another AWS account. While the implemented 'sync' DSC CLI command is quite simple, this relies on configuring bucket policies and IAM Roles and policies to enable cross-account data transfer. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1352
6336db8 to
40e26be
Compare
Purpose and background context
DSC supports submissions to both test and prod instances of MITL's DSpace repositories. Deposits to test are performed in the Dev and Stage environments, while deposits to prod are performed in the Prod environment. The required data must be present in, or uploaded to, the appropriate S3 buckets for each environment. This method provides an easy way to sync the data between two S3 buckets using the AWS CLI 'aws s3 sync' command.
How can a reviewer manually see the effects of these changes?
sync; the execution fails because theS3_BUCKET_SYNC_SOURCEenv var is not set. In Dev, the developer is expected to pass"skip_sync": "true"as part of the payload.Includes new or updated dependencies?
YES - Installed Moto's extras feature
serverto run a stand-alone server for testing.Changes expectations for external applications?
YES - In the case of DSC workflows, the command will be used to sync data from an S3 bucket to (an S3 bucket) in another AWS account. While the implemented 'sync' DSC CLI command is quite simple, this relies on configuring bucket policies and IAM Roles and policies to enable cross-account data transfer.
What are the relevant tickets?
Developer
Code Reviewer(s)