Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Jul 28, 2025

Purpose and background context

DSC supports submissions to both test and prod instances of MITL's DSpace repositories. Deposits to test are performed in the Dev and Stage environments, while deposits to prod are performed in the Prod environment. The required data must be present in, or uploaded to, the appropriate S3 buckets for each environment. This method provides an easy way to sync the data between two S3 buckets using the AWS CLI 'aws s3 sync' command.

How can a reviewer manually see the effects of these changes?

  1. Review the unit test.
  2. Review test executions of updated DSC step function

Includes new or updated dependencies?

YES - Installed Moto's extras feature server to run a stand-alone server for testing.

Changes expectations for external applications?

YES - In the case of DSC workflows, the command will be used to sync data from an S3 bucket to (an S3 bucket) in another AWS account. While the implemented 'sync' DSC CLI command is quite simple, this relies on configuring bucket policies and IAM Roles and policies to enable cross-account data transfer.

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@coveralls
Copy link

coveralls commented Jul 28, 2025

Pull Request Test Coverage Report for Build 16680188437

Details

  • 34 of 39 (87.18%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 96.872%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/cli.py 30 35 85.71%
Totals Coverage Status
Change from base Build 16629981396: -0.3%
Covered Lines: 960
Relevant Lines: 991

💛 - Coveralls

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review July 29, 2025 14:02
@jonavellecuerdo jonavellecuerdo requested a review from a team as a code owner July 29, 2025 14:02
@ehanson8 ehanson8 self-assigned this Jul 29, 2025
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looking real good.

Requesting changes for the question about the config property s3_bucket_sync_source and whether it should raise an exception if that env var is not set.

Additionally, I think we might benefit from a couple error situation tests in test_cli.py for this new sync command? A couple possible ones:

  1. env var S3_BUCKET_SYNC_SOURCE is not set, but explicit source is not provided
  2. there is nothing at the S3 source to sync

Number 2 above might be tricky... maybe not needed. To test that, that might suggest what's actually needed is some checking of the source before syncing to see if there's something actually there.

If we do think that's worthwhile, then a test when nothing is there would be helpful.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff! A few comments in addition to seconding @ghukill 's comments

dsc/cli.py Outdated
Comment on lines 164 to 170
@click.option(
"--endpoint-url",
help=(
"Specifies a custom endpoint URL to which which the AWS CLI sends S3 requests "
"instead of the default AWS S3 service endpoint"
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this mostly just for testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this question prompted me to determine whether setting AWS_ENDPOINT_URL as an environment variable is sufficient and I believe it is!

I've updated the sync CLI command and updated tests/test_cli.py::test_sync_success to set the AWS_ENDPOINT_URL env var using monkeypatch.

Comment on lines +189 to +185
If 'source' and 'destination' are not provided, the method will derive values
based on the required '--batch-id / -b' and 'workflow-name / -w' options and
S3 bucket env vars:
* source: batch path in S3_BUCKET_SYNC_SOURCE
* destination: batch path in S3_BUCKET_SUBMISSION_ASSETS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is unnecessary flexibility, is there actually a situation where we would do something other than workflow_name/batch_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While these options will not be used in production, I do think it can be handy for cases where we're running DSC locally or using a local file system--for testing or otherwise! What's important is that source and destination are configured as required=False.

dsc/cli.py Outdated
Comment on lines 224 to 230
optional_args = []
if dry_run:
optional_args.append("--dryrun")
if endpoint_url:
optional_args.extend(["--endpoint-url", endpoint_url])

args.extend(optional_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't these 2 args just be appended/extended to args directly without the optional_args var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. I will update to use the args variable and add a comment to flag optional args instead.

jonavellecuerdo added a commit that referenced this pull request Jul 31, 2025
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! With one clarifying but optional comment

dsc/config.py Outdated
Comment on lines 55 to 56
if not value:
raise OSError("Env var 'S3_BUCKET_SUBMISSION_ASSETS' must be defined")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jonavellecuerdo
Copy link
Contributor Author

@ghukill @ehanson8 I actually ended up back-tracking the change I made to Config.s3_bucket_sync_source. Please take a look at the latest commit and let me know what you think.

Why these changes are being introduced:
DSC supports submissions to both test and prod instances of MITL's
DSpace repositories. Deposits to test are performed in the Dev and
Stage environments, while deposits to prod are performed in the
Prod environment. The required data must be present in, or uploaded to,
the appropriate S3 buckets for each environment. This method
provides an easy way to sync the data between two S3 buckets
using the AWS CLI 'aws s3 sync' command.

How this addresses that need:
* Add 'sync' CLI command
* Add optional env var 'S3_BUCKET_SYNC_SOURCE'
* Update Dockerfile to include install of AWS CLI (and its dependencies)
* Add 'sync' unit test with moto_server

Side effects of this change:
* In the case of DSC workflows, the command will be used to sync data
from an S3 bucket to (an S3 bucket) in another AWS account. While the
implemented 'sync' DSC CLI command is quite simple, this relies on
configuring bucket policies and IAM Roles and policies to
enable cross-account data transfer.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1352
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1352-add-sync-cli-command branch from 6336db8 to 40e26be Compare August 1, 2025 16:29
@jonavellecuerdo jonavellecuerdo merged commit f650c4a into main Aug 1, 2025
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1352-add-sync-cli-command branch August 1, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants