Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Sep 16, 2025

Purpose and background context

Support DSC workflows that require programmatic batch creation.

How can a reviewer manually see the effects of these changes?

For now, passing unit tests will suffice.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES - Enables the addition of a "batch creation" step to the DSC workflow.

What are the relevant tickets?

Why these changes are being introduced:
* Adding an explicit 'batch creation' step to the DSC workflow
means a new method that can handle all init-level operations, such as
persisting records to DynamoDB for the first time. This allows for
simplifying some of the methods in the ItemSubmission and ItemSubmissionDB
models.

How this addresses that need:
* Refactor ItemSubmissionDB.create as a conditional save method;
instances are created by calling the ItemSubmissionDB constructor
* Add save method to domain model, which uses ItemSubmissionDB.create
to perform conditional save
* Deprecate unused ItemSubmissionDB.get_or_create method

Side effects of this change:
* This encourages the following approach:
if writing new records, use ItemSubmission.create + ItemSubmission.save
if updating existing records, set attributes on ItemSubmission() + ItemSubmission.upsert_db

TLDR: Any upserts will update and overwrite an existing record in DynamoDB.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1455
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1455-create-batch-for-base-workflow branch from adf4ae1 to f2d9e07 Compare September 16, 2025 17:29
@coveralls
Copy link

coveralls commented Sep 16, 2025

Pull Request Test Coverage Report for Build 18017142515

Details

  • 43 of 51 (84.31%) changed or added relevant lines in 5 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.5%) to 96.154%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/exceptions.py 7 8 87.5%
dsc/workflows/base/init.py 14 21 66.67%
Files with Coverage Reduction New Missed Lines %
dsc/item_submission.py 1 99.29%
Totals Coverage Status
Change from base Build 17497497241: -0.5%
Covered Lines: 1075
Relevant Lines: 1118

💛 - Coveralls

Comment on lines +293 to +315
item_submission = ItemSubmission.get(
batch_id=self.batch_id,
item_identifier=item_metadata["item_identifier"],
workflow_name=self.workflow_name,
source_system_identifier=item_metadata.get("source_system_identifier"),
)

# if no corresponding record in DynamoDB, skip
if not item_submission:
continue
Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, the Workflow.reconcile_items method is simply updated to use ItemSubmission.get, removing the option to create the record during the reconcile step. This is because a record is now added during the create-batch step, which assumes the responsibility of creating an item in the DynamoDB table for every item in a batch.

Note: If ItemSubmission.get returns None--i.e., a corresponding record is not found in DynamoDB--the item submission is skipped from the reconcile. This raises the question of how to report these cases, as reconcile_items currently only tracks metadata without bitstreams and vice versa. I propose resolving this as outside of the scope of this ticket and will likely be revisited as we work on the Wiley workflow and can take a closer look.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this @jonavellecuerdo. I'm appreciating the tension here given that reconcile was still using metadata to drive its iteration. And I think this may prompt another comment for me on that specific line.

Regardless of the outcome on the discussion, I support moving forward and maybe creating a ticket if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we switch to iteration from DynamoDB rather than the metadata, this check is unnecessary

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review September 16, 2025 17:47
@jonavellecuerdo jonavellecuerdo requested a review from a team as a code owner September 16, 2025 17:47
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good to me, nice work on scaffolding this update.

Requesting comment for some inline comments, and this question:

Do we need a new CLI command for create-batch?

Comment on lines +293 to +315
item_submission = ItemSubmission.get(
batch_id=self.batch_id,
item_identifier=item_metadata["item_identifier"],
workflow_name=self.workflow_name,
source_system_identifier=item_metadata.get("source_system_identifier"),
)

# if no corresponding record in DynamoDB, skip
if not item_submission:
continue
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this @jonavellecuerdo. I'm appreciating the tension here given that reconcile was still using metadata to drive its iteration. And I think this may prompt another comment for me on that specific line.

Regardless of the outcome on the discussion, I support moving forward and maybe creating a ticket if needed.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! A few comments, mostly piggybacking off of @ghukill

Comment on lines +293 to +315
item_submission = ItemSubmission.get(
batch_id=self.batch_id,
item_identifier=item_metadata["item_identifier"],
workflow_name=self.workflow_name,
source_system_identifier=item_metadata.get("source_system_identifier"),
)

# if no corresponding record in DynamoDB, skip
if not item_submission:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we switch to iteration from DynamoDB rather than the metadata, this check is unnecessary

jonavellecuerdo added a commit that referenced this pull request Sep 19, 2025
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1455-create-batch-for-base-workflow branch from 9adb187 to 7f5a824 Compare September 19, 2025 17:59
@jonavellecuerdo
Copy link
Contributor Author

@ghukill @ehanson8

The latest commits were motivated by the discussion in #200 (comment). In order to support moving the item_metadata_iter into the batch creation step--specifically, during the writing of records to DynamoDB, the following updates were implemented:

NOTE: While I put in effort to update existing tests, I've opted not to add to the existing test suite. I am hoping to do some test cleanup as part of the later tickets as adding new tests to the existing suite has proven to be quite difficult. I do think the existing tests do offer some value but could be refactored to be clearer and to better support new tests for added functions.

@ghukill
Copy link

ghukill commented Sep 19, 2025

Thanks for the summary @jonavellecuerdo. Before I jump into a review, I have a concern:

If I'm recalling correctly, DynamoDB has fairly small limits on row size, I think like 400kb or something. While I'm betting most metadata would fit, it feels risky to store it there.

Admiteddly, I have not fully embarked on a review to understand how and why this change was needed, but I do see this in a commit message:

Side effects of this change: * In storing the source metadata as serialized JSON in DynamoDB, any subsequent methods can simply read the metadata from DynamoDB whereas previously, it would have to reach into the metadata file in S3 each time.

If the goal is just to avoid S3 reads, I'm not sure Dynamo is a good fit. If part of the goal is to normalize how different workflows loop through metadata, that feels a bit different. Happy to chat it out if helpful, sync or async, but I do worry that Dynamo isn't up for the task of storing item metadata.

@ehanson8
Copy link
Contributor

If the goal is just to avoid S3 reads, I'm not sure Dynamo is a good fit. If part of the goal is to normalize how different workflows loop through metadata, that feels a bit different. Happy to chat it out if helpful, sync or async, but I do worry that Dynamo isn't up for the task of storing item metadata.

I was surprised by this as well. I know I was saying to "minimize calls to item_metadata_iter()" but I was more thinking about driving reconcile with DynamoDB post-create_batch rather than moving source metadata and bitstreams functionality away from submit. I don't know that create_batch needs to be handling that, so I think this can be a simpler set of changes just related to reconcile

@ghukill
Copy link

ghukill commented Sep 19, 2025

If the goal is just to avoid S3 reads, I'm not sure Dynamo is a good fit. If part of the goal is to normalize how different workflows loop through metadata, that feels a bit different. Happy to chat it out if helpful, sync or async, but I do worry that Dynamo isn't up for the task of storing item metadata.

I was surprised by this as well. I know I was saying to "minimize calls to item_metadata_iter()" but I was more thinking about driving reconcile with DynamoDB post-create_batch rather than moving source metadata and bitstreams functionality away from submit. I don't know that create_batch needs to be handling that, so I think this can be a simpler set of changes just related to reconcile

This was my understanding as well.

If we assume it feels correct that the result of create_batch would be two things:

  1. a folder in S3 with all metadata and bitstreams
  2. all intellectual items from that batch are in Dynamo

Then it feels like reconcile, submit, and finalize can all use Dynamo to drive their loop.

What this complicates is when you have a "loose" file in the batch of which there is no metadata. How can we know, if we are driving reconciliation by intellectual items in Dynamo, that came from metadata in the first place?

My answer would be ignoring loose files. Perhaps when things settle down we could go back with a microscope and try to figure out how to bubble up that reconciliation error, but it continues to feel odd. I think we should let metadata records drive the whole thing, obviously noting when we can't find any bitstreams for them, but choose to ignore when there are "loose" files in the batch. Or heck, perhaps reconciliation could identify them by process of elimination? If we unite metadata records with bitstreams (passing reconciliation), but we have a file left over at the end, could we:

  1. Mark every row in Dynamo as reconciled=passed because all items with metadata have bitstreams
  2. Have the reconciliation job as a whole still fail, becaus of this error

@ehanson8
Copy link
Contributor

My answer would be ignoring loose files.

I agree, this is a minimal risk in the first place but we could have a ticket to address the issue later

@jonavellecuerdo jonavellecuerdo force-pushed the IN-1455-create-batch-for-base-workflow branch from 7f5a824 to eb1c57c Compare September 24, 2025 19:04
@jonavellecuerdo
Copy link
Contributor Author

@ghukill @ehanson8 I am requesting for another pass at review! As mentioned, I dropped the two commits that involved changes to the DynamoDB model. I've refactored the batch creation methods of Workflow.prepare_batch and Workflow._create_batch_in_db. While we discussed holding off on touching subclasses, I did find it helpful to go ahead and derive SimpleCSV.prepare_batch (it really wasn't a lot of code)!

Looking forward to getting your thoughts. :)

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hearty approval! This is looking great. I know there is still wiring to be done for the CLI, and I'm betting implementing in all the workflows may put some pressure on it and potentially require little tweaks, but it feels like a good foundation for this batch creation.

And, potentially the removal of reconcile, broadly speaking.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, a few comments!

jonavellecuerdo added a commit that referenced this pull request Sep 25, 2025
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good changes, let's just remove the _insufficient_item_metadata method and then I'll approve!

jonavellecuerdo added a commit that referenced this pull request Sep 25, 2025
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

Why these changes are being introduced:
* Support workflows that require programmatic batch creation.
It was decided that the batch creation step would assume
the responsibilities that were previously assigned to the
reconcile step, which is now up for deprecation.

How this addresses that need:
* Define shared private method for writing records to DynamoDB
* Add abstract method custom batch preparation processes
* Define SimpleCSV.prepare_batch
* Add unit tests for batch creation methods
* Remove record creation from Workflow.reconcile_items

Side effects of this change:
* Sets the stage for deprecation of 'reconcile_items'

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1455
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1455-create-batch-for-base-workflow branch from 8a05e5c to eafb01e Compare September 25, 2025 18:31
@jonavellecuerdo jonavellecuerdo merged commit 7ba3fce into main Sep 25, 2025
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1455-create-batch-for-base-workflow branch September 25, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants