Add ArchivesSpace workflow #198

ehanson8 · 2025-08-25T20:04:54Z

Purpose and background context

Add ArchivesSpace workflow and associated functionality.

I originally was thinking I needed a more invasive approach for generating the ingest report but realize this could all be accomplished with the DynamoDB data. To ensure DSC is only generating a report for items that were successfully ingested on the current run, I’m using the following criteria:

The ingest report runs after all results messages are processed and all item submissions are updated in DynamoDB
status in DynamoDB is set to INGEST_SUCCESS
run_date in DynamoDB matches self.run_date to ensure it was updated in the current run

The check on line 500 of dsc/workflows/base/__init__.py ensures anything already set to INGEST_SUCCESS will be skipped before the last_run_date is set on line 528. Please let me know if there holes in this logic!

How can a reviewer manually see the effects of these changes?

Not possible at this stage

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/IN-1100

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * A new workflow is needed to handle ArchivesSpace deposits, which require an ingest report to be produced that will be used to update ArchivesSpace with the newly-created DSpace handles. How this addresses that need: * Add source_system_identifier attribute to ItemSubmissionDB class * Add source_system_identifier attribute to ItemSubmission class and update get_or_create and create methods to include it as well as corresponding unit tests * Add ArchivesSpace class with new output_path property and a workflow_specific_processing method that creates an ingest report for updating ArchivesSpace as well as corresponding unit tests * Add ArchivesSpace to dsc/workflows/__init__.py * Update Workflow.reconcile_items method to include source_system_identifier in ItemSubmission.create() call * Add ArchivesSpace metadata mapping * Add archivesspace_workflow_instance fixture Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1100

dsc/workflows/archivesspace.py

ehanson8 · 2025-08-25T20:11:58Z

dsc/workflows/metadata_mapping/archivesspace.json

This is the mapping I’ve used for previous DDC uploads but I will be checking if they want to add more fields

ehanson8 · 2025-08-25T20:12:22Z

tests/conftest.py



+@pytest.fixture
+@freeze_time("2025-01-01 09:00:00")


Needed for the unit tests given the reliance on self.run_date

ehanson8 · 2025-08-25T20:13:09Z

dsc/workflows/archivesspace.py

+            handle_uri_mapping[item_submission.source_system_identifier] = (
+                item_submission.dspace_handle
+                if item_submission.dspace_handle
+                else "DSpace handle not set, possible error"


I’m not including the possibility that items were sent to a DSpace submission queue (which does not create a handle) because DDC has never used them

ghukill

I'm unsure if I'm formerly requesting any changes, but had a few comments and questions.

For the most part, I'm following along and looks pretty good! I think how small the ArchivesSpace workflow class is speaks to some of overall pattern of workflows extending "base" ones.

dsc/workflows/archivesspace.py

ghukill · 2025-08-26T12:59:38Z

dsc/item_submission.py

    batch_id: str
    item_identifier: str
    workflow_name: str
+    source_system_identifier: str | None = None


Picking this line somewhat arbitrarily, but wondering if you could talk a bit more about the source_system_identifier?

I noticed this trickles all the way into DynamoDB.

FWIW, I'm a big fan of storing as many identifiers as possible, but wondering what about this workflow made this seem important now versus other workflows.

I do like the name though: think it's very clear, very direct.

If we need to provide a link from the source system (here an archival object URI in ArchivesSpace) to the DSpace handle, we would use source_system_identifier. The ArchivesSpace workflow is the first with a defined use case for it. With the previous DDC deposits, I use this report to run a batch process that creates a new digital object with the DSpace handle attached to specified archival object.

I kept the name generic since the Imaging Lab or other future stakeholders might have a source system where they need to track DSpace handles for content that was deposited.

I like it. Are we expecting that some/many workflows may not provide this?

If so, if some workflows don't have a way to provide this value, should we populate it with the item_identifier? or is that misleading, and leaving it blank would actually be ideal?

That was the reason for the default of None, only need to supply when it'll actually be used as it is for ArchivesSpace

@ehanson8 In the case of ArchiveSpace workflows, what is the item identifier they provide in the CSV file? 🤔 Is the "item identifier" == "system source identifier"?

I guess I'd have to defer entirely to @ehanson8 here! Feels like a wrinkle of the ASpace workflow.

Conceptually, I can appreciate, and think we should support, when there is a "source system" identifier that does not match the "DSC / working item identifier" that stakeholders would like to provide.

I'll admit the latter, the "DSC / working item identifier" has always been a little odd, given that it really is kind of only for the DSC/DSS space.

But I could see a universe where the "source" identifier is something odd / long / weird, has unique characters, who knows. But the batch we are given in DSC, maybe it's just an incrementing number or something.

I would also not see any issue if they sometimes were the same. Or, as @ehanson8 points out, that often workflows won't have a "source" identifier.

I mean, maybe we should think of the DSC item_identifier more like batch_item_identifier which makes it clear it's an identifier that needs to be unique in the batch, for a workflow, but that's kind of the end of it's responsibility.

Noting that item_identifier can be repeated, right? like in a secondary, later batch? that would effectively update the item in DSpace?

item_identifier would be 02-000458300 (naming convention coming out of the digitization process used for the bitstreams).

This is the important part, DSC can't find the bitstreams if we are not using this identifier. The source_system_identifier (the archival object URI) serves no purpose to DSC but is needed for updating ArchivesSpace with the ingest report

@ehanson8 Thank you for clarifying the earlier statement! I am now fully on board a separate source_system_identifier column.

To answer @ghukill , I don't think DSS "updates" items in DSpace. For instance, if we ran submit on two separate occasions:

batch_id: aaa item_identifier: 123 collection_handle: /handle/a ... batch_id: bbb item_identifier: 123 collection_handle: /handle/a

The DSpace collection would have two items associated with item_identifier: 123.

@jonavellecuerdo is correct ⬆️

* Create successful_item_submissions variable from list comprehension * Replace csv module with pandas

ehanson8 · 2025-08-26T18:10:26Z

Pushed a commit with the discussed updates

coveralls · 2025-08-26T18:11:07Z

Pull Request Test Coverage Report for Build 17271907351

Details

33 of 34 (97.06%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.006%) to 96.688%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
dsc/workflows/archivesspace.py	27	28	96.43%

Totals
Change from base Build 17160792675:	0.006%
Covered Lines:	1051
Relevant Lines:	1087

💛 - Coveralls

ghukill · 2025-08-27T15:33:12Z

dsc/workflows/archivesspace.py

+        handle_uri_mapping = {}
+
+        # find item submissions that were successfully ingested on the current run
+        successful_item_submissions: list[ItemSubmission] = [


Thanks for list comprehension --> variable! Helps my reading of the code.

ghukill · 2025-08-27T15:34:37Z

dsc/workflows/archivesspace.py

-            writer.writerow(["ao_uri", "dspace_handle"])
-            for ao_uri, dspace_handle in handle_uri_mapping.items():
-                writer.writerow([ao_uri, dspace_handle])
+            df.to_csv(csv_file, index=False)


Awesome! I forget that we can use smart_open to get a file-like object and then pass that to df.to_csv(). Very nice!

jonavellecuerdo · 2025-08-27T15:41:06Z

dsc/workflows/metadata_mapping/archivesspace.json

+{
+  "dc.title": {
+    "source_field_name": "title",
+    "language": "en_US"


Just noting: I continue to feel not confident about setting "language": "en_US" as we don't have any rules/conventions for setting it (which is why I opted to exclude the field from metadata_mapping/opencourseware.json).

You are correct, we should probably not use it. Good catch! I'll remove before merging

Actually, I will just remove it from dc.title and leave it on rights_statement and description since those are boilerplate strings from DDC, if that's OK @jonavellecuerdo

* Add ArchivesSpace workflow Why these changes are being introduced: * A new workflow is needed to handle ArchivesSpace deposits, which require an ingest report to be produced that will be used to update ArchivesSpace with the newly-created DSpace handles. How this addresses that need: * Add source_system_identifier attribute to ItemSubmissionDB class * Add source_system_identifier attribute to ItemSubmission class and update get_or_create and create methods to include it as well as corresponding unit tests * Add ArchivesSpace class with new output_path property and a workflow_specific_processing method that creates an ingest report for updating ArchivesSpace as well as corresponding unit tests * Add ArchivesSpace to dsc/workflows/__init__.py * Update Workflow.reconcile_items method to include source_system_identifier in ItemSubmission.create() call * Add ArchivesSpace metadata mapping * Add archivesspace_workflow_instance fixture Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1100 * Updates based on discussion in PR # 198 * Create successful_item_submissions variable from list comprehension * Replace csv module with pandas * Remove language tag from dc.title in mapping

ehanson8 commented Aug 25, 2025

View reviewed changes

dsc/workflows/archivesspace.py Show resolved Hide resolved

ehanson8 commented Aug 25, 2025

View reviewed changes

dsc/workflows/archivesspace.py Show resolved Hide resolved

ehanson8 commented Aug 25, 2025

View reviewed changes

ehanson8 requested review from ghukill and jonavellecuerdo August 25, 2025 20:13

ehanson8 marked this pull request as ready for review August 25, 2025 20:13

ehanson8 requested a review from a team as a code owner August 25, 2025 20:13

ghukill self-assigned this Aug 26, 2025

ghukill reviewed Aug 26, 2025

View reviewed changes

Updates based on discussion in PR # 198

d9e8ad3

* Create successful_item_submissions variable from list comprehension * Replace csv module with pandas

ghukill self-requested a review August 27, 2025 15:31

ghukill approved these changes Aug 27, 2025

View reviewed changes

jonavellecuerdo reviewed Aug 27, 2025

View reviewed changes

jonavellecuerdo approved these changes Aug 27, 2025

View reviewed changes

Remove language tag from dc.title in mapping

93e2f94

ehanson8 merged commit c84dd64 into main Aug 27, 2025
3 checks passed

ehanson8 deleted the IN-1100-archivesspace-workflow branch August 27, 2025 15:59



		@pytest.fixture
		@freeze_time("2025-01-01 09:00:00")

Add ArchivesSpace workflow #198

Add ArchivesSpace workflow #198

Uh oh!

Conversation

ehanson8 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehanson8 commented Aug 26, 2025

Uh oh!

coveralls commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 17271907351

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ehanson8 commented Aug 25, 2025 •

edited

Loading

coveralls commented Aug 26, 2025 •

edited

Loading

jonavellecuerdo Aug 27, 2025 •

edited

Loading