Skip to content

Conversation

@ehanson8
Copy link
Contributor

@ehanson8 ehanson8 commented Jan 6, 2025

Purpose and background context

Add workflow functionality in addition to a reconcile CLI command. Reviewing each commits separately would be ideal.

The reconcile command would be run manually before each DSC submission batch to ensure that the metadata and bitstreams are properly aligned, ensuring there no bitstreams without metadata and no metadata without at least one bitstream. After that is confirmed, the soon-to-be-written deposit CLI command would be run for the batch.

How can a reviewer manually see the effects of these changes?

Not possible yet

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

* Refactor BaseWorkflow to move init args to class attributes
* Add load and get_workflow class methods to BaseWorkflow class
* Add metadata_mapping property to BaseWorkflow class
* Add click options to main CLI command
* Add load method call and other basic functionality to main CLI command
* Add post_main_group_subcommand to cli.py
* Add TestBaseWorkflow class and TestSimpleCSV class to conftest.py
Why these changes are being introduced:
* A CLI command is needed to reconcile metadata against files in the specified S3 batch path.

How this addresses that need:
* Add reconcile CLI command and corresponding CLI tests
* Add build_bitstream_dict, match_bitstreams_to_item_identifiers, and
match_item_identifiers_to_bitstreams functions as well as corresponding unit tests
* Add test_metadata_mapping.json to fixtures directory

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1128
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm understanding how this application is taking shape, and continue to think the overall pattern is a great one (i.e. workflows).

Leaving a "Comment" review at this time with somewhat high level questions.

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 I think this is a great start! As I reviewed the updates in this PR, it made me wonder about environment variables and wondered if we could have a team discussion to discuss the workflow implementation in more detail (see #56 (comment))?

Comment on lines +27 to +32
workflow_name: str = "base"
submission_system: str = "DSpace@MIT"
email_recipients: tuple[str] = ("None",)
metadata_mapping_path: str = ""
s3_bucket: str = ""
output_queue: str = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our recent discussions around how the workflows will be initialized made me wonder how these values are set for workflow instances.

Recalling what was written in the DSC engineering plan:

The application will be deployed as an ECS task that is executed via CLI commands using the boto3 library.


Expanding the instructions in the Example Application Run, these are the steps I figured I'd need to do:

Prerequisite

  1. Open a terminal, navigate to my local clone of DSC, and activate my virtual environment.

Running reconcile command

  1. Set up a .env file at the root of the DSC app directory. Looking at the current definition of the reconcile command, it seems like it needs to know the following values for a given workflow: [s3_bucket, batch_path]. The .env file should look like:
    DSC_S3_BUCKET="dsc"
  2. Run reconcile command in terminal:
    dsc -w <workflow-name> -b <batch_id> -c <collection-handle> reconcile
    

Running deposit command

Option A: Running deposit via the DSC CLI

  1. Set up a .env file at the root of the DSC app directory.
    DSC_S3_BUCKET="dsc"
    DSC_SUBMISSION_SYSTEM="DSpace@MIT"
    DSC_OUTPUT_QUEUE="output-queue"
  2. Run deposit command in terminal
    dsc -w <workflow-name> -b <batch-id> -c <collection-handle> deposit 
    

Option B: Running deposit ECS task using AWS CLI

  1. Set up a JSON file with overrides for the ECS task (e.g., overrides.json)

    {
     "containerOverrides": [
         {
             "name": "dsc-ecs-<env>",
             "command": [
                 "-w",
                 "<workflow-name>",
                 "-b",
                 "<batch-id>",
                 "-c",
                 "<collection-handle>",
                 "deposit"
             ],
             "environment": [
                 {
                     "name": "DSC_S3_BUCKET",
                     "value": "dsc"
                 },
                 {
                     "name": "DSC_SUBMISSION_SYSTEM",
                     "value": "DSpace@MIT"
                 },
                 {
                     "name": "DSC_OUTPUT_QUEUE",
                     "value": "<output-queue-name>"
                 }
             ]
         }
     ]
    }
  2. Run aws ecs run-task CLI command in terminal (see Carbon Makefile for example).

    aws ecs run-task --cluster dsc-ecs-<env> --task-definition dsc-ecs-<env>-workflow-deposit --launch-type="FARGATE" --region us-east-1 --network-configuration '{"awsvpcConfiguration": {"subnets": [<subnets>], "securityGroups": [<security-groups>],"assignPublicIp": "DISABLED"}}' --overrides file://overrides.json
    

Option C: Running deposit ECS task using ??? DSC CLI command using the boto3 library.
Note: This is how I interpreted what is currently written in the Engineering Plan. 🤔 I believe the idea was from our recent work with the Alma SAP Invoices UI where we defined an ECSClient to run ECS tasks.

This work needs to be scoped out some more, but in terms of passing...workflow parameter values, but it should support accessing workflow parameter values set in a .env via the Config.


Takeaways

  • Environment variable names subject to change (curious what folks would find helpful)🤓

  • Change request: Include additional variables related to reconcile, deposit, and finalize DSC CLI commands to REQUIRED_ENV_VARS and OPTIONAL_ENV_VARS in dsc.Config.

  • Request: Have a discussion as a team to scope out workflow implementation~

    • Create a ticket to scope out a CLI command that runs an ECS task for either DSC deposit CLI command.
    • Create a ticket for defining an ECSClient

Also, I think this started with having some questions about how the workflow attributes retrieve values.

  • I think it makes sense for workflow_name and metadata_mapping to be hardcoded in the class definition.
  • To support flexibility, I think submission_system, email_recipients, and s3_bucket should be accessed as environment variables via the Config. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted by @ehanson8 in Slack, the questions/comments above do not need to be addressed via this PR, but it has prompted a discussion about workflow implementation to be addressed in future PRs!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We synced up on env vars and arrived at the following decision:

  • Explicitly set class attributes in the workflow module:
class BaseWorkflow(ABC):
    """A base workflow class from which other workflow classes are derived."""

    workflow_name: str = "base"
    submission_system: str = "DSpace@MIT"
    email_recipients: tuple[str] = ("None",)
    metadata_mapping_path: str = ""
    s3_bucket: str = "dsc"
    output_queue: str = "dsc-unhandled"
  • Revisit use of env vars later!

* Update formatting of click options
* Shift reconcile functionality to BaseWorkflow.reconcile_bitstreams_and_metadata method and add corresponding unit tests
* Add unit tests for BaseWorkflow's load and get_workflow methods
* Refactor get_workflow method to use subclasses
* Remove WORKFLOWS constant from config.py
* Add InvalidWorkflowNameError exception
* Replace setdefault call with defaultdict in build_bitstream_dict function
* Remove file type filtering from build_bitstream_dict function and remove related unit and CLI tests
@coveralls
Copy link

coveralls commented Jan 7, 2025

Pull Request Test Coverage Report for Build 12675953525

Details

  • 75 of 76 (98.68%) changed or added relevant lines in 4 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.4%) to 96.597%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/workflows/base/init.py 49 50 98.0%
Files with Coverage Reduction New Missed Lines %
dsc/workflows/base/init.py 1 97.35%
Totals Coverage Status
Change from base Build 12602522762: 0.4%
Covered Lines: 369
Relevant Lines: 382

💛 - Coveralls

Comment on lines 114 to 120
# reconcile item identifiers against bitstreams
item_identifier_matches = match_item_identifiers_to_bitstreams(
bitstream_dict.keys(), item_identifiers
)
file_matches = match_bitstreams_to_item_identifiers(
bitstream_dict.keys(), item_identifiers
)
Copy link

@ghukill ghukill Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose taking this one step further and defining these as private methods on the class. Looking now, I think the same could be said for utilities.build_bitstream_dict().

In general, I think catch-all files like utilities.py or utils.py or helpers.py should really go under the microscope. I by no means think they should be avoided altogether... but when we have well defined classes like this project uses, I think the bar goes up for when something belongs here.

Getting a bit philosophical, I think moving them into methods kind of forces looking at them in context as well. Each of them is just a list comprehension. Do they actually need to be functions or methods? Would the method reconcile_bitstreams_and_metadata() actually read more cleanly without that misdirection of another function or method to read? In this case, I think it might.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on making them private methods, build_bitstream_dict can get both params from self. But even though the other 2 are just list comprehensions, they are involved enough that I personally think it reads and tests easier if they're still functions/methods

* Shift build_bitstream_dict, match_bitstreams_to_item_identifiers, and match_item_identifiers_to_bitstreams to BaseWorkflow private methods
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 Changes are looking good! Just have two comments (potential change requests).

Comment on lines 112 to 122
# reconcile item identifiers against bitstreams
item_identifier_matches = self._match_item_identifiers_to_bitstreams(
bitstream_dict.keys(), item_identifiers
)
file_matches = self._match_bitstreams_to_item_identifiers(
bitstream_dict.keys(), item_identifiers
)
logger.info(f"Item identifiers and bitstreams matched: {item_identifier_matches}")
no_bitstreams = set(item_identifiers) - set(item_identifier_matches)
no_item_identifiers = set(bitstream_dict.keys()) - set(file_matches)
return no_bitstreams, no_item_identifiers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, two questions:

  1. What do you think of the proposed name changes below? A little wordy but I think it helps with clarity (at least for someone like me who struggles with logic that involves matching 😅).
# reconcile item identifiers against bitstreams
item_identifiers_with_bitstream_matches = self._match_item_identifiers_to_bitstreams(
    bitstream_dict.keys(), item_identifiers
)
bitstreams_with_item_identifier_matches = self._match_bitstreams_to_item_identifiers(
    bitstream_dict.keys(), item_identifiers
)

logger.info(f"Item identifiers and bitstreams matched: {item_identifier_matches}")
no_bitstreams = set(item_identifiers) - set(item_identifier_with_bitstream_matches)
no_item_identifiers = set(bitstream_dict.keys()) - set(bitstreams_with_item_identifier_matches)
return no_bitstreams, no_item_identifiers
  1. Can we clarify the log message in line 119? These are technically:

Item identifiers from batch metadata with matching bitstreams:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarity is paramount even when wordy! Great suggestion, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I think this needs a small fix (and maybe slightly simpler naming convention):

# reconcile item identifiers against bitstreams
item_identifiers_with_bitstreams = (
    self._match_item_identifiers_to_bitstreams(
        bitstream_dict.keys(), item_identifiers
    )
)

bitstreams_with_item_identifiers = self._match_bitstreams_to_item_identifiers(
    bitstream_dict.keys(), item_identifiers
)

logger.info(
    "Item identifiers from batch metadata with matching bitstreams: "
    f"{item_identifiers_with_bitstreams}"
)

item_identifiers_without_bitstreams =  set(item_identifiers) - set(
    item_identifiers_with_bitstreams
)
bitstreams_without_item_identifiers = set(bitstream_dict.keys()) - set(bitstreams_with_item_identifiers)

return item_identifiers_without_bitstreams, bitstreams_without_item_identifiers

Essentially, my initial proposal was to rename file_matches variable as well!


This will align with the log messages in the cli.

  • I'd suggest maybe updating the variable names in the reconcile CLI command as well!

Sorry for the confusion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, fixed!

* Rename reconcile-related variables for clarity
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate all the thoughtful discussions @ehanson8. As discussed, further work will likely fill out pieces that feel a bit fuzzy, but this feels like a good foundation to begin with to build out workflows.

@ehanson8 ehanson8 merged commit 791c615 into main Jan 8, 2025
2 checks passed
@ehanson8 ehanson8 deleted the IN-1103-reconcile-cli-command branch January 8, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants