-
Notifications
You must be signed in to change notification settings - Fork 0
Add batch creation method to Workflow abstract base class #200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: * Adding an explicit 'batch creation' step to the DSC workflow means a new method that can handle all init-level operations, such as persisting records to DynamoDB for the first time. This allows for simplifying some of the methods in the ItemSubmission and ItemSubmissionDB models. How this addresses that need: * Refactor ItemSubmissionDB.create as a conditional save method; instances are created by calling the ItemSubmissionDB constructor * Add save method to domain model, which uses ItemSubmissionDB.create to perform conditional save * Deprecate unused ItemSubmissionDB.get_or_create method Side effects of this change: * This encourages the following approach: if writing new records, use ItemSubmission.create + ItemSubmission.save if updating existing records, set attributes on ItemSubmission() + ItemSubmission.upsert_db TLDR: Any upserts will update and overwrite an existing record in DynamoDB. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1455
adf4ae1 to
f2d9e07
Compare
Pull Request Test Coverage Report for Build 18017142515Details
💛 - Coveralls |
| item_submission = ItemSubmission.get( | ||
| batch_id=self.batch_id, | ||
| item_identifier=item_metadata["item_identifier"], | ||
| workflow_name=self.workflow_name, | ||
| source_system_identifier=item_metadata.get("source_system_identifier"), | ||
| ) | ||
|
|
||
| # if no corresponding record in DynamoDB, skip | ||
| if not item_submission: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, the Workflow.reconcile_items method is simply updated to use ItemSubmission.get, removing the option to create the record during the reconcile step. This is because a record is now added during the create-batch step, which assumes the responsibility of creating an item in the DynamoDB table for every item in a batch.
Note: If ItemSubmission.get returns None--i.e., a corresponding record is not found in DynamoDB--the item submission is skipped from the reconcile. This raises the question of how to report these cases, as reconcile_items currently only tracks metadata without bitstreams and vice versa. I propose resolving this as outside of the scope of this ticket and will likely be revisited as we work on the Wiley workflow and can take a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this @jonavellecuerdo. I'm appreciating the tension here given that reconcile was still using metadata to drive its iteration. And I think this may prompt another comment for me on that specific line.
Regardless of the outcome on the discussion, I support moving forward and maybe creating a ticket if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we switch to iteration from DynamoDB rather than the metadata, this check is unnecessary
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good to me, nice work on scaffolding this update.
Requesting comment for some inline comments, and this question:
Do we need a new CLI command for create-batch?
| item_submission = ItemSubmission.get( | ||
| batch_id=self.batch_id, | ||
| item_identifier=item_metadata["item_identifier"], | ||
| workflow_name=self.workflow_name, | ||
| source_system_identifier=item_metadata.get("source_system_identifier"), | ||
| ) | ||
|
|
||
| # if no corresponding record in DynamoDB, skip | ||
| if not item_submission: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this @jonavellecuerdo. I'm appreciating the tension here given that reconcile was still using metadata to drive its iteration. And I think this may prompt another comment for me on that specific line.
Regardless of the outcome on the discussion, I support moving forward and maybe creating a ticket if needed.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! A few comments, mostly piggybacking off of @ghukill
| item_submission = ItemSubmission.get( | ||
| batch_id=self.batch_id, | ||
| item_identifier=item_metadata["item_identifier"], | ||
| workflow_name=self.workflow_name, | ||
| source_system_identifier=item_metadata.get("source_system_identifier"), | ||
| ) | ||
|
|
||
| # if no corresponding record in DynamoDB, skip | ||
| if not item_submission: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we switch to iteration from DynamoDB rather than the metadata, this check is unnecessary
9adb187 to
7f5a824
Compare
|
The latest commits were motivated by the discussion in #200 (comment). In order to support moving the
NOTE: While I put in effort to update existing tests, I've opted not to add to the existing test suite. I am hoping to do some test cleanup as part of the later tickets as adding new tests to the existing suite has proven to be quite difficult. I do think the existing tests do offer some value but could be refactored to be clearer and to better support new tests for added functions. |
|
Thanks for the summary @jonavellecuerdo. Before I jump into a review, I have a concern: If I'm recalling correctly, DynamoDB has fairly small limits on row size, I think like 400kb or something. While I'm betting most metadata would fit, it feels risky to store it there. Admiteddly, I have not fully embarked on a review to understand how and why this change was needed, but I do see this in a commit message:
If the goal is just to avoid S3 reads, I'm not sure Dynamo is a good fit. If part of the goal is to normalize how different workflows loop through metadata, that feels a bit different. Happy to chat it out if helpful, sync or async, but I do worry that Dynamo isn't up for the task of storing item metadata. |
I was surprised by this as well. I know I was saying to "minimize calls to |
This was my understanding as well. If we assume it feels correct that the result of
Then it feels like reconcile, submit, and finalize can all use Dynamo to drive their loop. What this complicates is when you have a "loose" file in the batch of which there is no metadata. How can we know, if we are driving reconciliation by intellectual items in Dynamo, that came from metadata in the first place? My answer would be ignoring loose files. Perhaps when things settle down we could go back with a microscope and try to figure out how to bubble up that reconciliation error, but it continues to feel odd. I think we should let metadata records drive the whole thing, obviously noting when we can't find any bitstreams for them, but choose to ignore when there are "loose" files in the batch. Or heck, perhaps reconciliation could identify them by process of elimination? If we unite metadata records with bitstreams (passing reconciliation), but we have a file left over at the end, could we:
|
I agree, this is a minimal risk in the first place but we could have a ticket to address the issue later |
7f5a824 to
eb1c57c
Compare
|
@ghukill @ehanson8 I am requesting for another pass at review! As mentioned, I dropped the two commits that involved changes to the DynamoDB model. I've refactored the batch creation methods of Looking forward to getting your thoughts. :) |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hearty approval! This is looking great. I know there is still wiring to be done for the CLI, and I'm betting implementing in all the workflows may put some pressure on it and potentially require little tweaks, but it feels like a good foundation for this batch creation.
And, potentially the removal of reconcile, broadly speaking.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, a few comments!
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good changes, let's just remove the _insufficient_item_metadata method and then I'll approve!
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work!
Why these changes are being introduced: * Support workflows that require programmatic batch creation. It was decided that the batch creation step would assume the responsibilities that were previously assigned to the reconcile step, which is now up for deprecation. How this addresses that need: * Define shared private method for writing records to DynamoDB * Add abstract method custom batch preparation processes * Define SimpleCSV.prepare_batch * Add unit tests for batch creation methods * Remove record creation from Workflow.reconcile_items Side effects of this change: * Sets the stage for deprecation of 'reconcile_items' Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1455
8a05e5c to
eafb01e
Compare
Purpose and background context
Support DSC workflows that require programmatic batch creation.
How can a reviewer manually see the effects of these changes?
For now, passing unit tests will suffice.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES - Enables the addition of a "batch creation" step to the DSC workflow.
What are the relevant tickets?