Skip to content

Conversation

@ehanson8
Copy link
Contributor

@ehanson8 ehanson8 commented Aug 25, 2025

Purpose and background context

Add ArchivesSpace workflow and associated functionality.

I originally was thinking I needed a more invasive approach for generating the ingest report but realize this could all be accomplished with the DynamoDB data. To ensure DSC is only generating a report for items that were successfully ingested on the current run, I’m using the following criteria:

  • The ingest report runs after all results messages are processed and all item submissions are updated in DynamoDB
  • status in DynamoDB is set to INGEST_SUCCESS
  • run_date in DynamoDB matches self.run_date to ensure it was updated in the current run

The check on line 500 of dsc/workflows/base/__init__.py ensures anything already set to INGEST_SUCCESS will be skipped before the last_run_date is set on line 528. Please let me know if there holes in this logic!

How can a reviewer manually see the effects of these changes?

Not possible at this stage

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* A new workflow is needed to handle ArchivesSpace deposits, which require an ingest report to be produced that will be used to update ArchivesSpace with the newly-created DSpace handles.

How this addresses that need:
* Add source_system_identifier attribute to ItemSubmissionDB class
* Add source_system_identifier attribute to ItemSubmission class and update get_or_create and create methods to include it as well as corresponding unit tests
* Add ArchivesSpace class with new output_path property and a workflow_specific_processing method that creates an ingest report for updating ArchivesSpace as well as corresponding unit tests
* Add ArchivesSpace to dsc/workflows/__init__.py
* Update Workflow.reconcile_items method to include source_system_identifier in ItemSubmission.create() call
* Add ArchivesSpace metadata mapping
* Add archivesspace_workflow_instance fixture

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1100
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the mapping I’ve used for previous DDC uploads but I will be checking if they want to add more fields



@pytest.fixture
@freeze_time("2025-01-01 09:00:00")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for the unit tests given the reliance on self.run_date

handle_uri_mapping[item_submission.source_system_identifier] = (
item_submission.dspace_handle
if item_submission.dspace_handle
else "DSpace handle not set, possible error"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not including the possibility that items were sent to a DSpace submission queue (which does not create a handle) because DDC has never used them

@ehanson8 ehanson8 marked this pull request as ready for review August 25, 2025 20:13
@ehanson8 ehanson8 requested a review from a team as a code owner August 25, 2025 20:13
@ghukill ghukill self-assigned this Aug 26, 2025
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if I'm formerly requesting any changes, but had a few comments and questions.

For the most part, I'm following along and looks pretty good! I think how small the ArchivesSpace workflow class is speaks to some of overall pattern of workflows extending "base" ones.

batch_id: str
item_identifier: str
workflow_name: str
source_system_identifier: str | None = None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picking this line somewhat arbitrarily, but wondering if you could talk a bit more about the source_system_identifier?

I noticed this trickles all the way into DynamoDB.

FWIW, I'm a big fan of storing as many identifiers as possible, but wondering what about this workflow made this seem important now versus other workflows.

I do like the name though: think it's very clear, very direct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to provide a link from the source system (here an archival object URI in ArchivesSpace) to the DSpace handle, we would use source_system_identifier. The ArchivesSpace workflow is the first with a defined use case for it. With the previous DDC deposits, I use this report to run a batch process that creates a new digital object with the DSpace handle attached to specified archival object.

I kept the name generic since the Imaging Lab or other future stakeholders might have a source system where they need to track DSpace handles for content that was deposited.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Are we expecting that some/many workflows may not provide this?

If so, if some workflows don't have a way to provide this value, should we populate it with the item_identifier? or is that misleading, and leaving it blank would actually be ideal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the reason for the default of None, only need to supply when it'll actually be used as it is for ArchivesSpace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 In the case of ArchiveSpace workflows, what is the item identifier they provide in the CSV file? 🤔 Is the "item identifier" == "system source identifier"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'd have to defer entirely to @ehanson8 here! Feels like a wrinkle of the ASpace workflow.

Conceptually, I can appreciate, and think we should support, when there is a "source system" identifier that does not match the "DSC / working item identifier" that stakeholders would like to provide.

I'll admit the latter, the "DSC / working item identifier" has always been a little odd, given that it really is kind of only for the DSC/DSS space.

But I could see a universe where the "source" identifier is something odd / long / weird, has unique characters, who knows. But the batch we are given in DSC, maybe it's just an incrementing number or something.

I would also not see any issue if they sometimes were the same. Or, as @ehanson8 points out, that often workflows won't have a "source" identifier.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, maybe we should think of the DSC item_identifier more like batch_item_identifier which makes it clear it's an identifier that needs to be unique in the batch, for a workflow, but that's kind of the end of it's responsibility.

Noting that item_identifier can be repeated, right? like in a secondary, later batch? that would effectively update the item in DSpace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

item_identifier would be 02-000458300 (naming convention coming out of the digitization process used for the bitstreams).

This is the important part, DSC can't find the bitstreams if we are not using this identifier. The source_system_identifier (the archival object URI) serves no purpose to DSC but is needed for updating ArchivesSpace with the ingest report

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 Thank you for clarifying the earlier statement! I am now fully on board a separate source_system_identifier column.

To answer @ghukill , I don't think DSS "updates" items in DSpace. For instance, if we ran submit on two separate occasions:

batch_id: aaa
item_identifier: 123
collection_handle: /handle/a

...
batch_id: bbb
item_identifier: 123
collection_handle: /handle/a

The DSpace collection would have two items associated with item_identifier: 123.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo is correct ⬆️

* Create successful_item_submissions variable from list comprehension
* Replace csv module with pandas
@ehanson8
Copy link
Contributor Author

Pushed a commit with the discussed updates

@coveralls
Copy link

coveralls commented Aug 26, 2025

Pull Request Test Coverage Report for Build 17271907351

Details

  • 33 of 34 (97.06%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.006%) to 96.688%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/workflows/archivesspace.py 27 28 96.43%
Totals Coverage Status
Change from base Build 17160792675: 0.006%
Covered Lines: 1051
Relevant Lines: 1087

💛 - Coveralls

@ghukill ghukill self-requested a review August 27, 2025 15:31
handle_uri_mapping = {}

# find item submissions that were successfully ingested on the current run
successful_item_submissions: list[ItemSubmission] = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for list comprehension --> variable! Helps my reading of the code.

writer.writerow(["ao_uri", "dspace_handle"])
for ao_uri, dspace_handle in handle_uri_mapping.items():
writer.writerow([ao_uri, dspace_handle])
df.to_csv(csv_file, index=False)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I forget that we can use smart_open to get a file-like object and then pass that to df.to_csv(). Very nice!

{
"dc.title": {
"source_field_name": "title",
"language": "en_US"
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting: I continue to feel not confident about setting "language": "en_US" as we don't have any rules/conventions for setting it (which is why I opted to exclude the field from metadata_mapping/opencourseware.json).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, we should probably not use it. Good catch! I'll remove before merging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I will just remove it from dc.title and leave it on rights_statement and description since those are boilerplate strings from DDC, if that's OK @jonavellecuerdo

@ehanson8 ehanson8 merged commit c84dd64 into main Aug 27, 2025
3 checks passed
@ehanson8 ehanson8 deleted the IN-1100-archivesspace-workflow branch August 27, 2025 15:59
ehanson8 added a commit that referenced this pull request Sep 2, 2025
* Add ArchivesSpace workflow

Why these changes are being introduced:
* A new workflow is needed to handle ArchivesSpace deposits, which require an ingest report to be produced that will be used to update ArchivesSpace with the newly-created DSpace handles.

How this addresses that need:
* Add source_system_identifier attribute to ItemSubmissionDB class
* Add source_system_identifier attribute to ItemSubmission class and update get_or_create and create methods to include it as well as corresponding unit tests
* Add ArchivesSpace class with new output_path property and a workflow_specific_processing method that creates an ingest report for updating ArchivesSpace as well as corresponding unit tests
* Add ArchivesSpace to dsc/workflows/__init__.py
* Update Workflow.reconcile_items method to include source_system_identifier in ItemSubmission.create() call
* Add ArchivesSpace metadata mapping
* Add archivesspace_workflow_instance fixture

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1100

* Updates based on discussion in PR # 198

* Create successful_item_submissions variable from list comprehension
* Replace csv module with pandas

* Remove language tag from dc.title in mapping
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants