-
Notifications
You must be signed in to change notification settings - Fork 0
In 1433 define record level reconcile item methods #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In 1433 define record level reconcile item methods #197
Conversation
Why these changes are being introduced: * Simple update to the ItemSubmission class that may be handy for certain use cases. How this addresses that need: * Add 'get_or_create' method to ItemSubmission class * Update Workflow.reconcile_items to use new method Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1433
Why these changes are being introduced: This change is in line ensuring all key processes are performed with the ItemSubmission class. This is a pre-requisite for defining a record-level 'reconcile_item' abstract method that uses the ItemSubmission class. How this addresses that need: * Add 'source_metadata' attribute to ItemSubmission * Set 'source_metadata' in Workflow.reconcile_items Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1433
Why these changes are being introduced: * The final Workflow.reconcile_items method needs subclasses to define a method for reconciling item submissions at the record-level. How this addresses that need: * Add abstract method 'reconcile_item' to base Workflow * Define SimpleCSV.reconcile_item to check whether an item submission is associated with any bitstreams * Define OpenCourseWare.reconcile_item to check whether an item submission includes metadata * Add unit tests for 'reconcile_item' methods Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1433
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving without hesitation.
I did leave one comment about the status_details string that each <Workflow>.reconcile_item() is expected to return as part of its tuple. While it's not a requested change -- hence the PR approval -- it felt worth mentioning.
Why I'm not requesting a change now, is that I think starting with line 284 in Wofklow.reconcile_items(), # check for unmatched bitstreams, it is still very workflow events driven, and there is some interplay here.
Up until that point the code is delightful, with workflows applying their reconciliation logic and ItemSubmission instances getting updated.
Once you hit line 284 where you're trying to then answer, "Were there any failures? if so, what were they? what should I log? and lastly, what is the final reconcliation verdict across all items?" it feels like all those ItemSubmission instances we have still on hand, in memory might be sufficient to answer that. Which circles back to the status_details the Workflow reoncile_item() methods are returning. If those were enumerated values, almost like reconciliation failure reasons, and they were also attached to the ItemSubmission (even if temporarily) then you could just loop through all ItemSubmission instances and you'd have enough information to log and report out on.
But, I think anything like that would extend the excellent work in this PR. It might be worth taking this bedrock progress, merging it, and then moving on to other workflows. Perhaps we hit another workflow in Wiley, or scanned thesis, or something... where we want to pass a reconciliation message like, "Whoops! This item is 100 terabytes and is therefore too big" and then we might find a bit more control or normalization of status details, and how to report different types of failures, may bubble up.
TL/DR: approved, awesome work, a couple notes for possible future work
dsc/workflows/base/__init__.py
Outdated
| ) | ||
| else: | ||
| current_status = ItemSubmissionStatus.RECONCILE_FAILED | ||
| if status_details == "missing bitstreams": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm noting this as I move through, that this makes me a little nervous but doesn't feel like a dealbreaker.
What makes me nervous is that this str will be determined by the abstract method reconcile_item, where each workflow implements it, but to my reading there is no typing or enforcement of what those strings bubbling back up will be.
A possible improvement could be an Enum of allowed status details? But again, still midway during review. As-is, not blocking, just noting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I would control the the values since there's only 2 possibilities: missing bitstreams or missing metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the latest commit!
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, I do think it controlling the tuple value is important to address but otherwise this is all a great step in the right direction!
dsc/workflows/base/__init__.py
Outdated
| ) | ||
| else: | ||
| current_status = ItemSubmissionStatus.RECONCILE_FAILED | ||
| if status_details == "missing bitstreams": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I would control the the values since there's only 2 possibilities: missing bitstreams or missing metadata
* Create StrEnum of ItemSubmissionStatusDetails * Clarify when table is updated
Pull Request Test Coverage Report for Build 17160435696Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to stir up trouble here, but I'm unsure about the enum being in the DB model.
I had expected it to be in the base Workflow file, called something like ReconcileFailureDetails:
class ReconcileFailureDetails(StrEnum):
MISSING_BITSTREAMS = "missing bitstreams"
MISSING_METADATA = "missing metadata"Then, you could type the response of the abstract method reconcile_item(). I might even suggest using that as an opportunity to further streamline it:
@abstractmethod
def reconcile_item(
self,
item_submission: ItemSubmission,
) -> None | ReconcileFailureDetails:Instead of returning a tuple, now you either get None (reconciliation success) or you get a typed, controlled reason enum value back why it failed.
In a way, I think this gets at @ehanson8's comment about "RECONCILE_" being part of enums or exceptions; it's not needed when inside an enum that is about failure.
How I think this differs from the ItemSubmissionStatusDetails enum in the DB file, is you notice all those RECONCILE_FAILURE_ prefixes. This enum would likely grow as we find other reasons to push status details to the DB (e.g. submit failed for reason X, finalize was interdeterminate for reason Y).
I would vote to keep the DB status_details field model non-controlled, even though status_details is, allowing any part of the application to provide details where and when it can.
What an enum like ReconcileFailureDetails does back in the Workflow base and actual classes is allow for fully typing methods, removing the need for tuple[bool, str] responses, etc. It becomes representation of a state, more than just a controlled string vocabulary. Ideally, that enum would be the only place we see the string "missing bitstreams" and the we'd never see any of those methods accepting or returning strings, only that enum.
This is a pretty opinionated stance, so please pushback, but that was my feeling after a read!
|
I agree with @ghukill that the enum should live the Workflow file not the DB model. All other changes look great though! |
|
FYI @jonavellecuerdo
I'll admit, looking at this again... these are starting to look like custom exceptions? Just a hypothetical, instead of an enum, what if Just a thought. |
|
@ghukill Thank you for the suggestion! Please see the latest commit where I implemented a solution based on your last two comments on the PR. Question for both of you: Do you think it's okay for |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work again here.
I felt like I was able to read through the full reconcile_items() method without tripping anywhere. Feels like it has really evolved.
I was ready for another enthusiastic Approve, but I'm getting failing tests:
test_sync_success()test_sync_use_source_and_destination_success()
I have a hunch they are related to moto mocking and....maybe my usage of AWS SSO?
Here is some output:
INFO dsc.cli:cli.py:208 Syncing data from s3://source/test/batch-aaa/ to s3://destination/test/batch-aaa/
ERROR dsc.cli:cli.py:240 fatal error: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.
ERROR dsc.cli:cli.py:247 Failed to sync (exit code: 1)
and the other test:
INFO dsc.cli:cli.py:208 Syncing data from s3://source/test/batch-aaa to s3://destination/test/batch-aaa
ERROR dsc.cli:cli.py:240 fatal error: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.
ERROR dsc.cli:cli.py:247 Failed to sync (exit code: 1)
Otherwise, it's looking good! Ready for approval when this is resolved.
| try: | ||
| self.reconcile_item(item_submission) | ||
| except ReconcileFailedError as exception: | ||
| reconcile_status = ItemSubmissionStatus.RECONCILE_FAILED | ||
| status_details = str(exception) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is feeling pretty good. And FWIW, this did not seem obvious at the onset of this work; like an iterative finding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
| if isinstance(exception, ReconcileFailedMissingBitstreamsError): | ||
| metadata_without_bitstreams.append(item_submission.item_identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice application here: this feels like when controlled exceptions (though would have worked with enums) start to pay dividends.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! And I'm not getting the same test failures as @ghukill when running the them locally
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved!
Whatever weird AWS credentials + testing stuff is happening appears local to my machine. If we're passing in CI, then great.
I'll continue to investigate, and may open a PR if it's a worthwhile update, but seems isolated to me.
c028025 to
3461e00
Compare
|
Thank you for your thoughtful reviews! I am glad with how this turned out and am excited to see how it holds up to the upcoming workflow implementations. 🤓 🤞🏼 |
For completeness sake, I had an outdated |
Purpose and background context
Where do I begin . . . This PR marks the epic conclusion to aligning the structure of the
reconcile_itemsmethod with the other workflow methods (submit_itemsandfinalize_items). Keeping this PR description on the shorter side as I recommend reviewers to review the commits and the provided commit messages in chronological order.How can a reviewer manually see the effects of these changes?
Review updated unit tests. Made an effort to align the
SimpleCSVandOpenCourseWareunit tests and did a good amount of cleanup!Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?