Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Oct 7, 2025

Purpose and background context

A CLI command is required to support the new batch creation step for DSC. These changes also connect the batch creation step with the syncing process:

  • Workflow.create_batch can skip data retrieval (e.g. API requests) when --sync-data is provided in the command args
  • The create CLI command can invoke the 'sync' command when --sync-data is provided in the command args

A couple thoughts:

  • Syncing defaults to False.
    • For Dev and Stage, syncing is not expected.
      *For Prod, syncing the batch data from the S3 bucket in Stage is expected.
  • While a workflow like Wiley could very well check if data exists in the S3 bucket before making the API request, passing the synced boolean to the batch creation methods allows DSC to skip unnecessary reads to S3 altogether.

How can a reviewer manually see the effects of these changes?

Review of added unit tests is sufficient for this PR

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES - This will require changes to the DSC step function. By enabling the create CLI command to invoke the sync CLI command, I expect this will allow us to simplify the logic of our step function and consolidate logs for both commands in a single CloudWatch logstream. For context, the latest version of the step function we were proposing is outlined in jc-test-dsc-with-sync-dev. Ticket IN-1457 will address updates to the step function.

What are the relevant tickets?

Why these changes are being introduced:
* A CLI command is required to support the new batch creation
step for DSC. These changes also connect the batch creation
step with the syncing process:

* Workflow.create_batch can skip data retrieval (e.g. API requests)
 when '--sync-data' is provided in the command args
* The 'create' CLI command can invoke the 'sync' command when
'--sync-data' is provided in the command args

How this addresses that need:
* Add 'create' CLI command
* Add 'synced' boolean keyword argument to batch creation methods
* Add unit tests

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1456
def prepare_batch(
self,
*,
synced: bool = False, # noqa: ARG002
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For workflows like Wiley, the synced boolean can be used to skip data retrieval steps if the data has been synced (e.g., in Prod, the batch "folder" and its contents will be pulled from Stage).

Also, note that while SimpleCSV _does not use the synced argument, the name is not leading with an underscore. This is because mypy will raise an error about the mismatched parameter names between the @abstractmethod and the implemented, and it's suspected that this is because the parameter is a keyword argument. 🤔 Ruff error [ARG002](https://docs.astral.sh/ruff/rules/unused-method-argument/) will need to be skipped as a result.

TLDR: Either mypy or Ruff would have an issue. Decided to ignore Ruff's error.

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review October 7, 2025 17:37
@jonavellecuerdo jonavellecuerdo requested a review from a team as a code owner October 7, 2025 17:37
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but looking good!

dsc/cli.py Outdated
Comment on lines 94 to 117
@click.option(
"-s",
"--source",
help=(
"Source directory formatted as a local filesystem path or "
"an S3 URI in s3://bucket/prefix form"
),
)
@click.option(
"-d",
"--destination",
help=(
"Destination directory formatted as a local filesystem path or "
"an S3 URI in s3://bucket/prefix form"
),
)
@click.option(
"--dry-run",
is_flag=True,
help=(
"Display the operations that would be performed using the "
"sync command without actually running them"
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these should be prepended with sync_ for greater clarity?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @ehanson8 here (or sync- if we're going with dashes in CLI args).

Copy link
Contributor

@ehanson8 ehanson8 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sync-!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the signature of the create CLI command as follows:

@main.command()
@click.pass_context
@click.option("--sync-data/--no-sync-data", default=False)
@click.option(
    "--sync-dry-run",
    is_flag=True,
    help=(
        "Display the operations that would be performed using the "
        "sync command without actually running them"
    ),
)
@click.option(
    "-s",
    "--sync-source",
    help=(
        "Source directory formatted as a local filesystem path or "
        "an S3 URI in s3://bucket/prefix form"
    ),
)
@click.option(
    "-d",
    "--sync-destination",
    help=(
        "Destination directory formatted as a local filesystem path or "
        "an S3 URI in s3://bucket/prefix form"
    ),
)
@click.option(
    "-e",
    "--email-recipients",
    help="The recipients of the batch creation results email as a comma-delimited string",
    default=None,
)
def create(
    ctx: click.Context,
    *,
    sync_data: bool = False,
    sync_dry_run: bool = False,
    sync_source: str | None = None,
    sync_destination: str | None = None,
    email_recipients: str | None = None,
) -> None:

Note: By moving the asterisk (*) to after ctx: click.Context, the args can be arranged in a more logical order (but it does mean if create is ever called outside of the pipenv command or not as a CLI, user must provide all params as keyword arguments (doesn't seem like a bad call given the number of params anyway).

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great start. I left a couple of higher level questions, and can do a secondary pass after discussions there.

dsc/cli.py Outdated
Comment on lines 136 to 137
if sync_data:
ctx.invoke(sync, source, destination, dry_run=dry_run)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried this locally? I'm a little concerned that sys.exit(return_code) line in the sync CLI command may terminate the parent python process and therefore not make it past this step?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally and related, what happens if the sync subcommand has an error? how do we know and how is it handled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Tagging @ehanson8 for awareness)

This was a great question and led to the following changes:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving it! Thanks for the deep dive on this and the explanation.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 18502412282

Details

  • 33 of 34 (97.06%) changed or added relevant lines in 4 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.2%) to 96.015%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/cli.py 28 29 96.55%
Files with Coverage Reduction New Missed Lines %
dsc/cli.py 1 88.89%
Totals Coverage Status
Change from base Build 18285846599: 0.2%
Covered Lines: 1036
Relevant Lines: 1079

💛 - Coveralls

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, thanks for the changes!


if sync_data:
ctx.invoke(sync, source, destination, dry_run=dry_run)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jonavellecuerdo jonavellecuerdo merged commit 0f3549a into main Oct 14, 2025
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1456-add-create-cli-command branch October 15, 2025 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants