-
Notifications
You must be signed in to change notification settings - Fork 0
Add ArchivesSpace workflow #198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: * A new workflow is needed to handle ArchivesSpace deposits, which require an ingest report to be produced that will be used to update ArchivesSpace with the newly-created DSpace handles. How this addresses that need: * Add source_system_identifier attribute to ItemSubmissionDB class * Add source_system_identifier attribute to ItemSubmission class and update get_or_create and create methods to include it as well as corresponding unit tests * Add ArchivesSpace class with new output_path property and a workflow_specific_processing method that creates an ingest report for updating ArchivesSpace as well as corresponding unit tests * Add ArchivesSpace to dsc/workflows/__init__.py * Update Workflow.reconcile_items method to include source_system_identifier in ItemSubmission.create() call * Add ArchivesSpace metadata mapping * Add archivesspace_workflow_instance fixture Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the mapping I’ve used for previous DDC uploads but I will be checking if they want to add more fields
|
|
||
|
|
||
| @pytest.fixture | ||
| @freeze_time("2025-01-01 09:00:00") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for the unit tests given the reliance on self.run_date
| handle_uri_mapping[item_submission.source_system_identifier] = ( | ||
| item_submission.dspace_handle | ||
| if item_submission.dspace_handle | ||
| else "DSpace handle not set, possible error" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not including the possibility that items were sent to a DSpace submission queue (which does not create a handle) because DDC has never used them
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure if I'm formerly requesting any changes, but had a few comments and questions.
For the most part, I'm following along and looks pretty good! I think how small the ArchivesSpace workflow class is speaks to some of overall pattern of workflows extending "base" ones.
| batch_id: str | ||
| item_identifier: str | ||
| workflow_name: str | ||
| source_system_identifier: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picking this line somewhat arbitrarily, but wondering if you could talk a bit more about the source_system_identifier?
I noticed this trickles all the way into DynamoDB.
FWIW, I'm a big fan of storing as many identifiers as possible, but wondering what about this workflow made this seem important now versus other workflows.
I do like the name though: think it's very clear, very direct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to provide a link from the source system (here an archival object URI in ArchivesSpace) to the DSpace handle, we would use source_system_identifier. The ArchivesSpace workflow is the first with a defined use case for it. With the previous DDC deposits, I use this report to run a batch process that creates a new digital object with the DSpace handle attached to specified archival object.
I kept the name generic since the Imaging Lab or other future stakeholders might have a source system where they need to track DSpace handles for content that was deposited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it. Are we expecting that some/many workflows may not provide this?
If so, if some workflows don't have a way to provide this value, should we populate it with the item_identifier? or is that misleading, and leaving it blank would actually be ideal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the reason for the default of None, only need to supply when it'll actually be used as it is for ArchivesSpace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 In the case of ArchiveSpace workflows, what is the item identifier they provide in the CSV file? 🤔 Is the "item identifier" == "system source identifier"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'd have to defer entirely to @ehanson8 here! Feels like a wrinkle of the ASpace workflow.
Conceptually, I can appreciate, and think we should support, when there is a "source system" identifier that does not match the "DSC / working item identifier" that stakeholders would like to provide.
I'll admit the latter, the "DSC / working item identifier" has always been a little odd, given that it really is kind of only for the DSC/DSS space.
But I could see a universe where the "source" identifier is something odd / long / weird, has unique characters, who knows. But the batch we are given in DSC, maybe it's just an incrementing number or something.
I would also not see any issue if they sometimes were the same. Or, as @ehanson8 points out, that often workflows won't have a "source" identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, maybe we should think of the DSC item_identifier more like batch_item_identifier which makes it clear it's an identifier that needs to be unique in the batch, for a workflow, but that's kind of the end of it's responsibility.
Noting that item_identifier can be repeated, right? like in a secondary, later batch? that would effectively update the item in DSpace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
item_identifierwould be02-000458300(naming convention coming out of the digitization process used for the bitstreams).
This is the important part, DSC can't find the bitstreams if we are not using this identifier. The source_system_identifier (the archival object URI) serves no purpose to DSC but is needed for updating ArchivesSpace with the ingest report
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehanson8 Thank you for clarifying the earlier statement! I am now fully on board a separate source_system_identifier column.
To answer @ghukill , I don't think DSS "updates" items in DSpace. For instance, if we ran submit on two separate occasions:
batch_id: aaa
item_identifier: 123
collection_handle: /handle/a
...
batch_id: bbb
item_identifier: 123
collection_handle: /handle/a
The DSpace collection would have two items associated with item_identifier: 123.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonavellecuerdo is correct ⬆️
* Create successful_item_submissions variable from list comprehension * Replace csv module with pandas
|
Pushed a commit with the discussed updates |
Pull Request Test Coverage Report for Build 17271907351Details
💛 - Coveralls |
| handle_uri_mapping = {} | ||
|
|
||
| # find item submissions that were successfully ingested on the current run | ||
| successful_item_submissions: list[ItemSubmission] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for list comprehension --> variable! Helps my reading of the code.
| writer.writerow(["ao_uri", "dspace_handle"]) | ||
| for ao_uri, dspace_handle in handle_uri_mapping.items(): | ||
| writer.writerow([ao_uri, dspace_handle]) | ||
| df.to_csv(csv_file, index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I forget that we can use smart_open to get a file-like object and then pass that to df.to_csv(). Very nice!
| { | ||
| "dc.title": { | ||
| "source_field_name": "title", | ||
| "language": "en_US" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting: I continue to feel not confident about setting "language": "en_US" as we don't have any rules/conventions for setting it (which is why I opted to exclude the field from metadata_mapping/opencourseware.json).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, we should probably not use it. Good catch! I'll remove before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I will just remove it from dc.title and leave it on rights_statement and description since those are boilerplate strings from DDC, if that's OK @jonavellecuerdo
* Add ArchivesSpace workflow Why these changes are being introduced: * A new workflow is needed to handle ArchivesSpace deposits, which require an ingest report to be produced that will be used to update ArchivesSpace with the newly-created DSpace handles. How this addresses that need: * Add source_system_identifier attribute to ItemSubmissionDB class * Add source_system_identifier attribute to ItemSubmission class and update get_or_create and create methods to include it as well as corresponding unit tests * Add ArchivesSpace class with new output_path property and a workflow_specific_processing method that creates an ingest report for updating ArchivesSpace as well as corresponding unit tests * Add ArchivesSpace to dsc/workflows/__init__.py * Update Workflow.reconcile_items method to include source_system_identifier in ItemSubmission.create() call * Add ArchivesSpace metadata mapping * Add archivesspace_workflow_instance fixture Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1100 * Updates based on discussion in PR # 198 * Create successful_item_submissions variable from list comprehension * Replace csv module with pandas * Remove language tag from dc.title in mapping
Purpose and background context
Add
ArchivesSpaceworkflow and associated functionality.I originally was thinking I needed a more invasive approach for generating the ingest report but realize this could all be accomplished with the DynamoDB data. To ensure DSC is only generating a report for items that were successfully ingested on the current run, I’m using the following criteria:
statusin DynamoDB is set toINGEST_SUCCESSrun_datein DynamoDB matchesself.run_dateto ensure it was updated in the current runThe check on line 500 of
dsc/workflows/base/__init__.pyensures anything already set toINGEST_SUCCESSwill be skipped before thelast_run_dateis set on line 528. Please let me know if there holes in this logic!How can a reviewer manually see the effects of these changes?
Not possible at this stage
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)