Add CLI command to sync data between two locations #188

jonavellecuerdo · 2025-07-28T18:12:59Z

Purpose and background context

DSC supports submissions to both test and prod instances of MITL's DSpace repositories. Deposits to test are performed in the Dev and Stage environments, while deposits to prod are performed in the Prod environment. The required data must be present in, or uploaded to, the appropriate S3 buckets for each environment. This method provides an easy way to sync the data between two S3 buckets using the AWS CLI 'aws s3 sync' command.

How can a reviewer manually see the effects of these changes?

Review the unit test.
Review test executions of updated DSC step function
- Created a separate step function for testing: jc-test-dsc-with-sync-dev
- Ran two executions:
  - 3bddc747-9a05-44fe-a352-3621350b36fc: Attempts to run sync; the execution fails because the S3_BUCKET_SYNC_SOURCE env var is not set. In Dev, the developer is expected to pass "skip_sync": "true" as part of the payload.
  - ba31df59-f004-43ee-ae87-6d7474fde98f: Skips running sync; the execution is successful.
- There will be a separate ticket, IN-1353, where we can discuss the proposed updates to the step function in detail.

Includes new or updated dependencies?

YES - Installed Moto's extras feature server to run a stand-alone server for testing.

Changes expectations for external applications?

YES - In the case of DSC workflows, the command will be used to sync data from an S3 bucket to (an S3 bucket) in another AWS account. While the implemented 'sync' DSC CLI command is quite simple, this relies on configuring bucket policies and IAM Roles and policies to enable cross-account data transfer.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/IN-1352

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

coveralls · 2025-07-28T18:15:14Z

Pull Request Test Coverage Report for Build 16680188437

Details

34 of 39 (87.18%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.3%) to 96.872%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
dsc/cli.py	30	35	85.71%

Totals
Change from base Build 16629981396:	-0.3%
Covered Lines:	960
Relevant Lines:	991

💛 - Coveralls

ghukill

Overall, looking real good.

Requesting changes for the question about the config property s3_bucket_sync_source and whether it should raise an exception if that env var is not set.

Additionally, I think we might benefit from a couple error situation tests in test_cli.py for this new sync command? A couple possible ones:

env var S3_BUCKET_SYNC_SOURCE is not set, but explicit source is not provided
there is nothing at the S3 source to sync

Number 2 above might be tricky... maybe not needed. To test that, that might suggest what's actually needed is some checking of the source before syncing to see if there's something actually there.

If we do think that's worthwhile, then a test when nothing is there would be helpful.

dsc/cli.py

dsc/config.py

tests/test_cli.py

ehanson8

Great stuff! A few comments in addition to seconding @ghukill 's comments

ehanson8 · 2025-07-29T18:43:29Z

dsc/cli.py

+@click.option(
+    "--endpoint-url",
+    help=(
+        "Specifies a custom endpoint URL to which which the AWS CLI sends S3 requests "
+        "instead of the default AWS S3 service endpoint"
+    ),
+)


Is this mostly just for testing?

Yes, this question prompted me to determine whether setting AWS_ENDPOINT_URL as an environment variable is sufficient and I believe it is!

I've updated the sync CLI command and updated tests/test_cli.py::test_sync_success to set the AWS_ENDPOINT_URL env var using monkeypatch.

ehanson8 · 2025-07-29T18:47:15Z

dsc/cli.py

+    If 'source' and 'destination' are not provided, the method will derive values
+    based on the required '--batch-id / -b' and 'workflow-name / -w' options and
+    S3 bucket env vars:
+        * source: batch path in S3_BUCKET_SYNC_SOURCE
+        * destination: batch path in S3_BUCKET_SUBMISSION_ASSETS


I wonder if this is unnecessary flexibility, is there actually a situation where we would do something other than workflow_name/batch_id?

While these options will not be used in production, I do think it can be handy for cases where we're running DSC locally or using a local file system--for testing or otherwise! What's important is that source and destination are configured as required=False.

ehanson8 · 2025-07-29T18:49:29Z

dsc/cli.py

+    optional_args = []
+    if dry_run:
+        optional_args.append("--dryrun")
+    if endpoint_url:
+        optional_args.extend(["--endpoint-url", endpoint_url])
+
+    args.extend(optional_args)


Couldn't these 2 args just be appended/extended to args directly without the optional_args var?

That is a good point. I will update to use the args variable and add a comment to flag optional args instead.

dsc/cli.py

dsc/config.py

ehanson8

Approved! With one clarifying but optional comment

dsc/config.py

ghukill · 2025-08-01T15:15:45Z

dsc/config.py

+        if not value:
+            raise OSError("Env var 'S3_BUCKET_SUBMISSION_ASSETS' must be defined")


jonavellecuerdo · 2025-08-01T15:26:27Z

@ghukill @ehanson8 I actually ended up back-tracking the change I made to Config.s3_bucket_sync_source. Please take a look at the latest commit and let me know what you think.

dsc/cli.py

Why these changes are being introduced: DSC supports submissions to both test and prod instances of MITL's DSpace repositories. Deposits to test are performed in the Dev and Stage environments, while deposits to prod are performed in the Prod environment. The required data must be present in, or uploaded to, the appropriate S3 buckets for each environment. This method provides an easy way to sync the data between two S3 buckets using the AWS CLI 'aws s3 sync' command. How this addresses that need: * Add 'sync' CLI command * Add optional env var 'S3_BUCKET_SYNC_SOURCE' * Update Dockerfile to include install of AWS CLI (and its dependencies) * Add 'sync' unit test with moto_server Side effects of this change: * In the case of DSC workflows, the command will be used to sync data from an S3 bucket to (an S3 bucket) in another AWS account. While the implemented 'sync' DSC CLI command is quite simple, this relies on configuring bucket policies and IAM Roles and policies to enable cross-account data transfer. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1352

jonavellecuerdo marked this pull request as ready for review July 29, 2025 14:02

jonavellecuerdo requested a review from a team as a code owner July 29, 2025 14:02

ehanson8 self-assigned this Jul 29, 2025

ghukill requested changes Jul 29, 2025

View reviewed changes

dsc/cli.py Show resolved Hide resolved

dsc/config.py Show resolved Hide resolved

tests/test_cli.py Show resolved Hide resolved

ehanson8 reviewed Jul 29, 2025

View reviewed changes

jonavellecuerdo added a commit that referenced this pull request Jul 31, 2025

Address comments in PR #188

a8adae3

jonavellecuerdo requested review from ehanson8 and ghukill July 31, 2025 19:57

ehanson8 approved these changes Aug 1, 2025

View reviewed changes

dsc/config.py Outdated Show resolved Hide resolved

ghukill approved these changes Aug 1, 2025

View reviewed changes

dsc/config.py Outdated

Comment on lines 55 to 56

if not value:

raise OSError("Env var 'S3_BUCKET_SUBMISSION_ASSETS' must be defined")

Copy link

ghukill Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

jonavellecuerdo requested review from ehanson8 and ghukill August 1, 2025 15:26

ghukill approved these changes Aug 1, 2025

View reviewed changes

dsc/cli.py Show resolved Hide resolved

ehanson8 approved these changes Aug 1, 2025

View reviewed changes

jonavellecuerdo added 3 commits August 1, 2025 12:29

Address comments in PR #188

f5dfb80

Raise click.Usage error when required configs aren't set for sync

40e26be

jonavellecuerdo force-pushed the IN-1352-add-sync-cli-command branch from 6336db8 to 40e26be Compare August 1, 2025 16:29

jonavellecuerdo merged commit f650c4a into main Aug 1, 2025
3 checks passed

jonavellecuerdo deleted the IN-1352-add-sync-cli-command branch August 1, 2025 16:33

		if not value:
		raise OSError("Env var 'S3_BUCKET_SUBMISSION_ASSETS' must be defined")

Add CLI command to sync data between two locations #188

Add CLI command to sync data between two locations #188

Uh oh!

Conversation

jonavellecuerdo commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

coveralls commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16680188437

Details

💛 - Coveralls

Uh oh!

ghukill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ghukill Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo commented Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jonavellecuerdo commented Jul 28, 2025 •

edited

Loading

coveralls commented Jul 28, 2025 •

edited

Loading

ghukill left a comment •

edited

Loading