Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Aug 21, 2025

Purpose and background context

Where do I begin . . . This PR marks the epic conclusion to aligning the structure of the reconcile_items method with the other workflow methods (submit_items and finalize_items). Keeping this PR description on the shorter side as I recommend reviewers to review the commits and the provided commit messages in chronological order.

How can a reviewer manually see the effects of these changes?

Review updated unit tests. Made an effort to align the SimpleCSV and OpenCourseWare unit tests and did a good amount of cleanup!

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Why these changes are being introduced:
* Simple update to the ItemSubmission class that may be handy for
certain use cases.

How this addresses that need:
* Add 'get_or_create' method to ItemSubmission class
* Update Workflow.reconcile_items to use new method

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1433
Why these changes are being introduced:
This change is in line ensuring all key processes are performed with the
ItemSubmission class. This is a pre-requisite for defining a record-level
'reconcile_item' abstract method that uses the ItemSubmission class.

How this addresses that need:
* Add 'source_metadata' attribute to ItemSubmission
* Set 'source_metadata' in Workflow.reconcile_items

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1433
Why these changes are being introduced:
* The final Workflow.reconcile_items method needs subclasses to define
a method for reconciling item submissions at the record-level.

How this addresses that need:
* Add abstract method 'reconcile_item' to base Workflow
* Define SimpleCSV.reconcile_item to check whether an item submission
is associated with any bitstreams
* Define OpenCourseWare.reconcile_item to check whether an item submission
includes metadata
* Add unit tests for 'reconcile_item' methods

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1433
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review August 21, 2025 15:18
@jonavellecuerdo jonavellecuerdo requested a review from a team as a code owner August 21, 2025 15:18
@ghukill ghukill self-assigned this Aug 21, 2025
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving without hesitation.

I did leave one comment about the status_details string that each <Workflow>.reconcile_item() is expected to return as part of its tuple. While it's not a requested change -- hence the PR approval -- it felt worth mentioning.

Why I'm not requesting a change now, is that I think starting with line 284 in Wofklow.reconcile_items(), # check for unmatched bitstreams, it is still very workflow events driven, and there is some interplay here.

Up until that point the code is delightful, with workflows applying their reconciliation logic and ItemSubmission instances getting updated.

Once you hit line 284 where you're trying to then answer, "Were there any failures? if so, what were they? what should I log? and lastly, what is the final reconcliation verdict across all items?" it feels like all those ItemSubmission instances we have still on hand, in memory might be sufficient to answer that. Which circles back to the status_details the Workflow reoncile_item() methods are returning. If those were enumerated values, almost like reconciliation failure reasons, and they were also attached to the ItemSubmission (even if temporarily) then you could just loop through all ItemSubmission instances and you'd have enough information to log and report out on.

But, I think anything like that would extend the excellent work in this PR. It might be worth taking this bedrock progress, merging it, and then moving on to other workflows. Perhaps we hit another workflow in Wiley, or scanned thesis, or something... where we want to pass a reconciliation message like, "Whoops! This item is 100 terabytes and is therefore too big" and then we might find a bit more control or normalization of status details, and how to report different types of failures, may bubble up.

TL/DR: approved, awesome work, a couple notes for possible future work

)
else:
current_status = ItemSubmissionStatus.RECONCILE_FAILED
if status_details == "missing bitstreams":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noting this as I move through, that this makes me a little nervous but doesn't feel like a dealbreaker.

What makes me nervous is that this str will be determined by the abstract method reconcile_item, where each workflow implements it, but to my reading there is no typing or enforcement of what those strings bubbling back up will be.

A possible improvement could be an Enum of allowed status details? But again, still midway during review. As-is, not blocking, just noting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I would control the the values since there's only 2 possibilities: missing bitstreams or missing metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the latest commit!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, I do think it controlling the tuple value is important to address but otherwise this is all a great step in the right direction!

)
else:
current_status = ItemSubmissionStatus.RECONCILE_FAILED
if status_details == "missing bitstreams":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I would control the the values since there's only 2 possibilities: missing bitstreams or missing metadata

jonavellecuerdo added a commit that referenced this pull request Aug 21, 2025
* Create StrEnum of ItemSubmissionStatusDetails
* Clarify when table is updated
@coveralls
Copy link

coveralls commented Aug 21, 2025

Pull Request Test Coverage Report for Build 17160435696

Details

  • 67 of 69 (97.1%) changed or added relevant lines in 5 files are covered.
  • 5 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.5%) to 96.682%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/workflows/base/init.py 40 42 95.24%
Files with Coverage Reduction New Missed Lines %
dsc/exceptions.py 5 81.58%
Totals Coverage Status
Change from base Build 17105956610: -0.5%
Covered Lines: 1020
Relevant Lines: 1055

💛 - Coveralls

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to stir up trouble here, but I'm unsure about the enum being in the DB model.

I had expected it to be in the base Workflow file, called something like ReconcileFailureDetails:

class ReconcileFailureDetails(StrEnum):
    MISSING_BITSTREAMS = "missing bitstreams"
    MISSING_METADATA = "missing metadata"

Then, you could type the response of the abstract method reconcile_item(). I might even suggest using that as an opportunity to further streamline it:

@abstractmethod
def reconcile_item(
    self,
    item_submission: ItemSubmission,
) -> None | ReconcileFailureDetails:

Instead of returning a tuple, now you either get None (reconciliation success) or you get a typed, controlled reason enum value back why it failed.

In a way, I think this gets at @ehanson8's comment about "RECONCILE_" being part of enums or exceptions; it's not needed when inside an enum that is about failure.

How I think this differs from the ItemSubmissionStatusDetails enum in the DB file, is you notice all those RECONCILE_FAILURE_ prefixes. This enum would likely grow as we find other reasons to push status details to the DB (e.g. submit failed for reason X, finalize was interdeterminate for reason Y).

I would vote to keep the DB status_details field model non-controlled, even though status_details is, allowing any part of the application to provide details where and when it can.

What an enum like ReconcileFailureDetails does back in the Workflow base and actual classes is allow for fully typing methods, removing the need for tuple[bool, str] responses, etc. It becomes representation of a state, more than just a controlled string vocabulary. Ideally, that enum would be the only place we see the string "missing bitstreams" and the we'd never see any of those methods accepting or returning strings, only that enum.

This is a pretty opinionated stance, so please pushback, but that was my feeling after a read!

@ehanson8
Copy link
Contributor

I agree with @ghukill that the enum should live the Workflow file not the DB model. All other changes look great though!

@ghukill
Copy link

ghukill commented Aug 21, 2025

FYI @jonavellecuerdo

Then, you could type the response of the abstract method reconcile_item(). I might even suggest using that as an opportunity to further streamline it:

@abstractmethod
def reconcile_item(
    self,
    item_submission: ItemSubmission,
) -> None | ReconcileFailureDetails:

I'll admit, looking at this again... these are starting to look like custom exceptions?

Just a hypothetical, instead of an enum, what if reconcile_item() either returned None or raised a custom exception? These exceptions, when stringified, could become the status_details that eventually end up in the DB.

Just a thought.

@jonavellecuerdo
Copy link
Contributor Author

@ghukill Thank you for the suggestion! Please see the latest commit where I implemented a solution based on your last two comments on the PR.

Question for both of you: Do you think it's okay for Workflow.reconcile_items to still log a warning regarding the overall status of the reconcile for the batch? 🤔

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work again here.

I felt like I was able to read through the full reconcile_items() method without tripping anywhere. Feels like it has really evolved.

I was ready for another enthusiastic Approve, but I'm getting failing tests:

  • test_sync_success()
  • test_sync_use_source_and_destination_success()

I have a hunch they are related to moto mocking and....maybe my usage of AWS SSO?

Here is some output:

INFO     dsc.cli:cli.py:208 Syncing data from s3://source/test/batch-aaa/ to s3://destination/test/batch-aaa/
ERROR    dsc.cli:cli.py:240 fatal error: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.

ERROR    dsc.cli:cli.py:247 Failed to sync (exit code: 1)

and the other test:

INFO     dsc.cli:cli.py:208 Syncing data from s3://source/test/batch-aaa to s3://destination/test/batch-aaa
ERROR    dsc.cli:cli.py:240 fatal error: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.

ERROR    dsc.cli:cli.py:247 Failed to sync (exit code: 1)

Otherwise, it's looking good! Ready for approval when this is resolved.

Comment on lines +258 to +262
try:
self.reconcile_item(item_submission)
except ReconcileFailedError as exception:
reconcile_status = ItemSubmissionStatus.RECONCILE_FAILED
status_details = str(exception)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is feeling pretty good. And FWIW, this did not seem obvious at the onset of this work; like an iterative finding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Comment on lines +264 to +265
if isinstance(exception, ReconcileFailedMissingBitstreamsError):
metadata_without_bitstreams.append(item_submission.item_identifier)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice application here: this feels like when controlled exceptions (though would have worked with enums) start to pay dividends.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! And I'm not getting the same test failures as @ghukill when running the them locally

@ghukill ghukill self-requested a review August 22, 2025 15:58
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved!

Whatever weird AWS credentials + testing stuff is happening appears local to my machine. If we're passing in CI, then great.

I'll continue to investigate, and may open a PR if it's a worthwhile update, but seems isolated to me.

@jonavellecuerdo jonavellecuerdo force-pushed the IN-1433-define-record-level-reconcile-item-methods branch from c028025 to 3461e00 Compare August 22, 2025 16:18
@jonavellecuerdo
Copy link
Contributor Author

Thank you for your thoughtful reviews! I am glad with how this turned out and am excited to see how it holds up to the upcoming workflow implementations. 🤓 🤞🏼

@jonavellecuerdo jonavellecuerdo merged commit cbf6c59 into main Aug 22, 2025
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1433-define-record-level-reconcile-item-methods branch August 22, 2025 16:36
@ghukill
Copy link

ghukill commented Aug 22, 2025

Whatever weird AWS credentials + testing stuff is happening appears local to my machine. If we're passing in CI, then great.

I'll continue to investigate, and may open a PR if it's a worthwhile update, but seems isolated to me.

For completeness sake, I had an outdated aws cli. Looks as though somewhere along the way aws s3 sync started to pickup AWS_ENDPOINT_URL without an explicit --endpoint= argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants