Update SCCS metadata mapping JSON file and support reruns #169
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose and background context
The initial goal of this PR was update the SCCS metadata mapping JSON file to fulfill a batch deposit request for SCCS
, which is encapsulated in the first commit 64da73e. The following screenshots and attachments show the results of the full DSC workflow run for this deposit.
As the third email shows, of the 106 items in the batch, 78 were successfully ingested by DSpace@MIT and 28 failed ingestion. These results served as the motivation for the other commits in this PR, which aim to address: how can we easily re-run DSC submissions on a subset of data (e.g., items that failed ingest)?
The additional commits are:
--skip-itemstosubmitCLI command to skip item submissions using a list of item identifiersingested_items.csvto avoid re-submitting items that were already successfully ingested. If you rerun thesubmitcommand, passing a comma-delimited string of these item identifiers to--skip-itemswill effectively result in DSC only submitting items that do not show up in the list.FinalizeReportWorkflow.get_bitstream_s3_urisshould capture any files in the batch folder that include the "item identifier" (for the item) in the object prefix. Unless we know all the files for every deposit for a given workflow will strictly follow a format and/or data type, we generally want these functions to be "greedy" when finding files. Thusly, if the DSpace metadata JSON files -- which always includes the item identifier in the object prefix -- from a previous run are still in the bucket, these will be included as bitstreams for the item. Thinking about this question led to the addition of one more commit (see next bullet)!SimpleCSVto exclude DSpace metadata JSON files when retrieving bitstreamssubmitwithout--skip-itemswill result in all metadata JSON files being overwritten.How can a reviewer manually see the effects of these changes?
This assumes you have two terminals open with AWS Dev credentials loaded in; set current working directories to DSC and DSS repos, respectively!
A. Updated SCCS metadata mapping JSON
To demonstrate this, I ran the full DSC workflow for the batch deposit requested via INFRA-509.
- DSC Reconcile Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' (sent at 3:07PM)
- DSC Submission Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' (sent at 3:10PM)
- DSpace Submission Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' (sent at 3:18PM)
Note: One item failed ingest but resolving this is outside the scope of this PR.B. Skip items when submitting
To demonstrate this, I ran the full DSC workflow twice using sample files for OpenCourseWare deposits:
1st run: All 5 items successfully ingested by DSpace
submitreport: DSC Submission Results - opencourseware, batch='ocw-batch-2025-02-10' (sent at 1:14PM)finalizereport: DSpace Submission Results - opencourseware, batch='ocw-batch-2025-02-10'2nd run: A single item is successfully re-ingested by DSpace (results in duplicate item in DSpace) (via
--skip-items "<item-identifiers-for-4-ingested-items>")from
finalizereport (I just copy this file into the root of thedspace-submission-composerrepo). Retrieve item identifiers from the CSV. Runpipenv run pythonand execute the following commands:submitCLI command:submitreport: DSC Submission Results - opencourseware, batch='ocw-batch-2025-02-10' (sent at 1:32PM)finalizereport: DSpace Submission Results - opencourseware, batch='ocw-batch-2025-02-10' (sent at 1:36PM)C. Remove DSpace metadata JSON files
--execute:Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES | NO
What are the relevant tickets?
Developer
Code Reviewer(s)