-
Notifications
You must be signed in to change notification settings - Fork 0
In 1103 reconcile cli command #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Refactor BaseWorkflow to move init args to class attributes * Add load and get_workflow class methods to BaseWorkflow class * Add metadata_mapping property to BaseWorkflow class * Add click options to main CLI command * Add load method call and other basic functionality to main CLI command * Add post_main_group_subcommand to cli.py * Add TestBaseWorkflow class and TestSimpleCSV class to conftest.py
Why these changes are being introduced: * A CLI command is needed to reconcile metadata against files in the specified S3 batch path. How this addresses that need: * Add reconcile CLI command and corresponding CLI tests * Add build_bitstream_dict, match_bitstreams_to_item_identifiers, and match_item_identifiers_to_bitstreams functions as well as corresponding unit tests * Add test_metadata_mapping.json to fixtures directory Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1128
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm understanding how this application is taking shape, and continue to think the overall pattern is a great one (i.e. workflows).
Leaving a "Comment" review at this time with somewhat high level questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 I think this is a great start! As I reviewed the updates in this PR, it made me wonder about environment variables and wondered if we could have a team discussion to discuss the workflow implementation in more detail (see #56 (comment))?
| workflow_name: str = "base" | ||
| submission_system: str = "DSpace@MIT" | ||
| email_recipients: tuple[str] = ("None",) | ||
| metadata_mapping_path: str = "" | ||
| s3_bucket: str = "" | ||
| output_queue: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our recent discussions around how the workflows will be initialized made me wonder how these values are set for workflow instances.
Recalling what was written in the DSC engineering plan:
The application will be deployed as an ECS task that is executed via CLI commands using the boto3 library.
Expanding the instructions in the Example Application Run, these are the steps I figured I'd need to do:
Prerequisite
- Open a terminal, navigate to my local clone of DSC, and activate my virtual environment.
Running reconcile command
- Set up a
.envfile at the root of the DSC app directory. Looking at the current definition of thereconcilecommand, it seems like it needs to know the following values for a given workflow:[s3_bucket, batch_path]. The.envfile should look like:DSC_S3_BUCKET="dsc" - Run
reconcilecommand in terminal:dsc -w <workflow-name> -b <batch_id> -c <collection-handle> reconcile
Running deposit command
Option A: Running deposit via the DSC CLI
- Set up a
.envfile at the root of the DSC app directory.DSC_S3_BUCKET="dsc" DSC_SUBMISSION_SYSTEM="DSpace@MIT" DSC_OUTPUT_QUEUE="output-queue"
- Run
depositcommand in terminaldsc -w <workflow-name> -b <batch-id> -c <collection-handle> deposit
Option B: Running deposit ECS task using AWS CLI
-
Set up a JSON file with overrides for the ECS task (e.g.,
overrides.json){ "containerOverrides": [ { "name": "dsc-ecs-<env>", "command": [ "-w", "<workflow-name>", "-b", "<batch-id>", "-c", "<collection-handle>", "deposit" ], "environment": [ { "name": "DSC_S3_BUCKET", "value": "dsc" }, { "name": "DSC_SUBMISSION_SYSTEM", "value": "DSpace@MIT" }, { "name": "DSC_OUTPUT_QUEUE", "value": "<output-queue-name>" } ] } ] } -
Run
aws ecs run-taskCLI command in terminal (see CarbonMakefilefor example).aws ecs run-task --cluster dsc-ecs-<env> --task-definition dsc-ecs-<env>-workflow-deposit --launch-type="FARGATE" --region us-east-1 --network-configuration '{"awsvpcConfiguration": {"subnets": [<subnets>], "securityGroups": [<security-groups>],"assignPublicIp": "DISABLED"}}' --overrides file://overrides.json
Option C: Running deposit ECS task using ??? DSC CLI command using the boto3 library.
Note: This is how I interpreted what is currently written in the Engineering Plan. 🤔 I believe the idea was from our recent work with the Alma SAP Invoices UI where we defined an ECSClient to run ECS tasks.
This work needs to be scoped out some more, but in terms of passing...workflow parameter values, but it should support accessing workflow parameter values set in a .env via the Config.
Takeaways
-
Environment variable names subject to change (curious what folks would find helpful)🤓
-
Change request: Include additional variables related to
reconcile,deposit, andfinalizeDSC CLI commands toREQUIRED_ENV_VARSandOPTIONAL_ENV_VARSindsc.Config. -
Request: Have a discussion as a team to scope out workflow implementation~
- Create a ticket to scope out a CLI command that runs an ECS task for either DSC
depositCLI command. - Create a ticket for defining an
ECSClient
- Create a ticket to scope out a CLI command that runs an ECS task for either DSC
Also, I think this started with having some questions about how the workflow attributes retrieve values.
- I think it makes sense for
workflow_nameandmetadata_mappingto be hardcoded in the class definition. - To support flexibility, I think
submission_system,email_recipients, ands3_bucketshould be accessed as environment variables via theConfig. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted by @ehanson8 in Slack, the questions/comments above do not need to be addressed via this PR, but it has prompted a discussion about workflow implementation to be addressed in future PRs!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We synced up on env vars and arrived at the following decision:
- Explicitly set class attributes in the workflow module:
class BaseWorkflow(ABC):
"""A base workflow class from which other workflow classes are derived."""
workflow_name: str = "base"
submission_system: str = "DSpace@MIT"
email_recipients: tuple[str] = ("None",)
metadata_mapping_path: str = ""
s3_bucket: str = "dsc"
output_queue: str = "dsc-unhandled"- Revisit use of env vars later!
* Update formatting of click options * Shift reconcile functionality to BaseWorkflow.reconcile_bitstreams_and_metadata method and add corresponding unit tests * Add unit tests for BaseWorkflow's load and get_workflow methods * Refactor get_workflow method to use subclasses * Remove WORKFLOWS constant from config.py * Add InvalidWorkflowNameError exception * Replace setdefault call with defaultdict in build_bitstream_dict function * Remove file type filtering from build_bitstream_dict function and remove related unit and CLI tests
Pull Request Test Coverage Report for Build 12675953525Details
💛 - Coveralls |
| # reconcile item identifiers against bitstreams | ||
| item_identifier_matches = match_item_identifiers_to_bitstreams( | ||
| bitstream_dict.keys(), item_identifiers | ||
| ) | ||
| file_matches = match_bitstreams_to_item_identifiers( | ||
| bitstream_dict.keys(), item_identifiers | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose taking this one step further and defining these as private methods on the class. Looking now, I think the same could be said for utilities.build_bitstream_dict().
In general, I think catch-all files like utilities.py or utils.py or helpers.py should really go under the microscope. I by no means think they should be avoided altogether... but when we have well defined classes like this project uses, I think the bar goes up for when something belongs here.
Getting a bit philosophical, I think moving them into methods kind of forces looking at them in context as well. Each of them is just a list comprehension. Do they actually need to be functions or methods? Would the method reconcile_bitstreams_and_metadata() actually read more cleanly without that misdirection of another function or method to read? In this case, I think it might.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on making them private methods, build_bitstream_dict can get both params from self. But even though the other 2 are just list comprehensions, they are involved enough that I personally think it reads and tests easier if they're still functions/methods
* Shift build_bitstream_dict, match_bitstreams_to_item_identifiers, and match_item_identifiers_to_bitstreams to BaseWorkflow private methods
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 Changes are looking good! Just have two comments (potential change requests).
dsc/workflows/base/__init__.py
Outdated
| # reconcile item identifiers against bitstreams | ||
| item_identifier_matches = self._match_item_identifiers_to_bitstreams( | ||
| bitstream_dict.keys(), item_identifiers | ||
| ) | ||
| file_matches = self._match_bitstreams_to_item_identifiers( | ||
| bitstream_dict.keys(), item_identifiers | ||
| ) | ||
| logger.info(f"Item identifiers and bitstreams matched: {item_identifier_matches}") | ||
| no_bitstreams = set(item_identifiers) - set(item_identifier_matches) | ||
| no_item_identifiers = set(bitstream_dict.keys()) - set(file_matches) | ||
| return no_bitstreams, no_item_identifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, two questions:
- What do you think of the proposed name changes below? A little wordy but I think it helps with clarity (at least for someone like me who struggles with logic that involves matching 😅).
# reconcile item identifiers against bitstreams
item_identifiers_with_bitstream_matches = self._match_item_identifiers_to_bitstreams(
bitstream_dict.keys(), item_identifiers
)
bitstreams_with_item_identifier_matches = self._match_bitstreams_to_item_identifiers(
bitstream_dict.keys(), item_identifiers
)
logger.info(f"Item identifiers and bitstreams matched: {item_identifier_matches}")
no_bitstreams = set(item_identifiers) - set(item_identifier_with_bitstream_matches)
no_item_identifiers = set(bitstream_dict.keys()) - set(bitstreams_with_item_identifier_matches)
return no_bitstreams, no_item_identifiers- Can we clarify the log message in line 119? These are technically:
Item identifiers from batch metadata with matching bitstreams:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarity is paramount even when wordy! Great suggestion, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, I think this needs a small fix (and maybe slightly simpler naming convention):
# reconcile item identifiers against bitstreams
item_identifiers_with_bitstreams = (
self._match_item_identifiers_to_bitstreams(
bitstream_dict.keys(), item_identifiers
)
)
bitstreams_with_item_identifiers = self._match_bitstreams_to_item_identifiers(
bitstream_dict.keys(), item_identifiers
)
logger.info(
"Item identifiers from batch metadata with matching bitstreams: "
f"{item_identifiers_with_bitstreams}"
)
item_identifiers_without_bitstreams = set(item_identifiers) - set(
item_identifiers_with_bitstreams
)
bitstreams_without_item_identifiers = set(bitstream_dict.keys()) - set(bitstreams_with_item_identifiers)
return item_identifiers_without_bitstreams, bitstreams_without_item_identifiersEssentially, my initial proposal was to rename file_matches variable as well!
This will align with the log messages in the cli.
- I'd suggest maybe updating the variable names in the
reconcileCLI command as well!
Sorry for the confusion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, fixed!
* Rename reconcile-related variables for clarity
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate all the thoughtful discussions @ehanson8. As discussed, further work will likely fill out pieces that feel a bit fuzzy, but this feels like a good foundation to begin with to build out workflows.
Purpose and background context
Add workflow functionality in addition to a
reconcileCLI command. Reviewing each commits separately would be ideal.The
reconcilecommand would be run manually before each DSC submission batch to ensure that the metadata and bitstreams are properly aligned, ensuring there no bitstreams without metadata and no metadata without at least one bitstream. After that is confirmed, the soon-to-be-writtendepositCLI command would be run for the batch.How can a reviewer manually see the effects of these changes?
Not possible yet
Includes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)