Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Mar 4, 2025

Purpose and background context

The initial goal of this PR was update the SCCS metadata mapping JSON file to fulfill a batch deposit request for SCCS
, which is encapsulated in the first commit 64da73e. The following screenshots and attachments show the results of the full DSC workflow run for this deposit.

  • DSC Reconcile Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' (sent at 9:41AM) image
  • DSC Submission Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' (sent at 9:42AM) image
  • DSpace Submission Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' image

As the third email shows, of the 106 items in the batch, 78 were successfully ingested by DSpace@MIT and 28 failed ingestion. These results served as the motivation for the other commits in this PR, which aim to address: how can we easily re-run DSC submissions on a subset of data (e.g., items that failed ingest)?

The additional commits are:

  • Add option --skip-items to submit CLI command to skip item submissions using a list of item identifiers
    • How is this useful: Retrieve list of item identifiers from ingested_items.csv to avoid re-submitting items that were already successfully ingested. If you rerun the submit command, passing a comma-delimited string of these item identifiers to --skip-items will effectively result in DSC only submitting items that do not show up in the list.
  • Fix counts in FinalizeReport
    • How is this useful: Corrects the error in the third screenshot shared above. The email should've clearly indicated that 106 messages were processed, 78 items were ingested, and 28 items encountered errors.
  • Create CLI command to bulk-delete metadata JSON files.
    • How is this useful: One of the steps to "reset" a deposit is to remove the DSpace metadata JSON files generated by DSC from the batch folder in S3. This is in order to avoid the inclusion of the DSpace metadata JSON files as bitstreams (in the case of reruns).
      • Why would the DSpace metdata JSON files get included as a bitstream?: For every item in a batch, Workflow.get_bitstream_s3_uris should capture any files in the batch folder that include the "item identifier" (for the item) in the object prefix. Unless we know all the files for every deposit for a given workflow will strictly follow a format and/or data type, we generally want these functions to be "greedy" when finding files. Thusly, if the DSpace metadata JSON files -- which always includes the item identifier in the object prefix -- from a previous run are still in the bucket, these will be included as bitstreams for the item. Thinking about this question led to the addition of one more commit (see next bullet)!
  • Update SimpleCSV to exclude DSpace metadata JSON files when retrieving bitstreams
    • How is this useful: Additional safeguard to ensure these metadata JSON files are not included as bitstreams in the event of reruns. This means that using the CLI command to bulk-delete metadata JSON files is no longer required (but is available if needed) as rerunning will simply override the metadata JSON files in the batch folder in S3.
      • Note: Running submit without --skip-items will result in all metadata JSON files being overwritten.

How can a reviewer manually see the effects of these changes?

This assumes you have two terminals open with AWS Dev credentials loaded in; set current working directories to DSC and DSS repos, respectively!

A. Updated SCCS metadata mapping JSON
To demonstrate this, I ran the full DSC workflow for the batch deposit requested via INFRA-509.

  • Items have been deposited to the following collection: https://mit-test.atmire.com/handle/1721.1/154307/recent-submissions.
  • The following emails were relevant to this run:
    • DSC Reconcile Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' (sent at 3:07PM)
      Reconcile results summary for sccs deposit
      Batch: sccs-batch-2025-02-09-mit-annual-reports
      Run date: 2025-03-05 20:06:23
      
      Results:
      Reconciled: 106
      
    • DSC Submission Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' (sent at 3:10PM)
      Submission results for sccs deposit
      Batch: sccs-batch-2025-02-09-mit-annual-reports
      Run date: 2025-03-05 20:09:34
      
      Results:
      Messages successfully sent to DSS: 106
      Errors: 0
      
    • DSpace Submission Results - sccs, batch='sccs-batch-2025-02-09-mit-annual-reports' (sent at 3:18PM)
      Ingest results for sccs deposit
      Batch: sccs-batch-2025-02-09-mit-annual-reports
      Run date: 2025-03-05 20:17:48
      
      Results:
      Processed: 106
      Ingested: 105
      Errors: 1
      
    Note: One item failed ingest but resolving this is outside the scope of this PR.

B. Skip items when submitting
To demonstrate this, I ran the full DSC workflow twice using sample files for OpenCourseWare deposits:

  • 1st run: All 5 items successfully ingested by DSpace

    • See submit report: DSC Submission Results - opencourseware, batch='ocw-batch-2025-02-10' (sent at 1:14PM)
      Submission results for opencourseware deposit
      Batch: ocw-batch-2025-02-10
      Run date: 2025-03-05 18:13:26
      
      Results:
      Messages successfully sent to DSS: 5
      Errors: 0
      
    • See finalize report: DSpace Submission Results - opencourseware, batch='ocw-batch-2025-02-10'
      Ingest results for opencourseware deposit
      Batch: ocw-batch-2025-02-10
      Run date: 2025-03-05 18:20:08
      
      Results:
      Processed: 5
      Ingested: 5
      Errors: 0
      
  • 2nd run: A single item is successfully re-ingested by DSpace (results in duplicate item in DSpace) (via --skip-items "<item-identifiers-for-4-ingested-items>")

    • Download ingested_items.csv
      from finalize report (I just copy this file into the root of the dspace-submission-composer repo). Retrieve item identifiers from the CSV. Run pipenv run python and execute the following commands:
      import pandas as pd
      
      df = pd.read_csv("ingested_items.csv")
      ",".join(df["item_identifier"].to_list()[:4])
      Copy the generated string.
    • Rerun submit CLI command:
      pipenv run dsc --workflow-name="opencourseware" --batch-id=ocw-batch-2025-02-10 submit -c /1721.1/154279 --skip-items '21g.321-spring-2013,21g.346-spring-2014,wgs.301j-fall-2014,14.02-fall-2004'
      You should see the following logs:
2025-03-05 13:31:00,446 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-03-05 13:31:00,446 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-03-05 13:31:00,446 INFO dsc.cli.main(): Running process
2025-03-05 13:31:00,447 INFO dsc.workflows.base.submit_items(): Submitting messages to the DSS input queue 'dss-input-dev' for batch 'ocw-batch-2025-02-10'
2025-03-05 13:31:01,940 INFO dsc.workflows.base.item_submissions_iter(): Skipping submission for item: 14.02-fall-2004
2025-03-05 13:31:02,955 INFO dsc.workflows.base.item_submissions_iter(): Preparing submission for item: 14.02-spring-2014
2025-03-05 13:31:03,474 INFO dsc.item_submission.upload_dspace_metadata(): Metadata uploaded to S3: s3://wiley-files-dev-222053980223/opencourseware/ocw-batch-2025-02-10/14.02-spring-2014_metadata.json
2025-03-05 13:31:03,759 INFO dsc.workflows.base.submit_items(): Sent item submission message: 5b0e5c0b-b7a3-4642-9f54-b8dc603a8817
2025-03-05 13:31:04,763 INFO dsc.workflows.base.item_submissions_iter(): Skipping submission for item: 21g.321-spring-2013
2025-03-05 13:31:05,707 INFO dsc.workflows.base.item_submissions_iter(): Skipping submission for item: 21g.346-spring-2014
2025-03-05 13:31:06,782 INFO dsc.workflows.base.item_submissions_iter(): Skipping submission for item: wgs.301j-fall-2014
2025-03-05 13:31:06,785 INFO dsc.workflows.base.submit_items(): Submitted messages to the DSS input queue 'dss-input-dev' for batch 'ocw-batch-2025-02-10': {"total": 1, "submitted": 1, "errors": 0}
2025-03-05 13:31:07,103 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-03-05 13:31:07,103 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:06.658427
  • See submit report: DSC Submission Results - opencourseware, batch='ocw-batch-2025-02-10' (sent at 1:32PM)
    Submission results for opencourseware deposit
    Batch: ocw-batch-2025-02-10
    Run date: 2025-03-05 18:31:06
    
    Results:
    Messages successfully sent to DSS: 1
    Errors: 0
    
  • Run DSS in Dev.
  • See finalize report: DSpace Submission Results - opencourseware, batch='ocw-batch-2025-02-10' (sent at 1:36PM)
    Ingest results for opencourseware deposit
    Batch: ocw-batch-2025-02-10
    Run date: 2025-03-05 18:36:21
    
    Results:
    Processed: 1 # <---- fixed counts!
    Ingested: 1 # <---- fixed counts!
    Errors: 0
    
  • If you take a look at the collection in Dev DSpace@MIT, it should show 6 items (or 6 + n, where n = the number of times you execute the steps for the 2nd run 😉 ).

C. Remove DSpace metadata JSON files

  1. Run the command:
    pipenv run dsc --workflow-name="opencourseware" --batch-id=ocw-batch-2025-03-05 remove-dspace-metadata --items "123.pdf,124.pdf"
    You should see the following output:
2025-03-05 13:49:49,115 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-03-05 13:49:49,115 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-03-05 13:49:49,115 INFO dsc.cli.main(): Running process
2025-03-05 13:49:49,189 INFO dsc.workflows.base.remove_dspace_metadata(): Searching for DSpace metadata JSON file(s) to delete
2025-03-05 13:49:49,333 INFO dsc.workflows.base.remove_dspace_metadata(): Found 2 DSpace metadata JSON file(s) to delete: ['opencourseware/ocw-batch-2025-03-05/123.pdf_metadata.json', 'opencourseware/ocw-batch-2025-03-05/124.pdf_metadata.json']
2025-03-05 13:49:49,334 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-03-05 13:49:49,334 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:00.219661
  1. Run the command with --execute:
    pipenv run dsc --workflow-name="opencourseware" --batch-id=ocw-batch-2025-03-05 remove-dspace-metadata --items "123.pdf" --execute
    
    You should see the following output:
2025-03-05 13:52:00,464 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-03-05 13:52:00,464 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-03-05 13:52:00,464 INFO dsc.cli.main(): Running process
2025-03-05 13:52:00,566 INFO dsc.workflows.base.remove_dspace_metadata(): Searching for DSpace metadata JSON file(s) to delete
2025-03-05 13:52:00,763 INFO dsc.workflows.base.remove_dspace_metadata(): Found 1 DSpace metadata JSON file(s) to delete: ['opencourseware/ocw-batch-2025-03-05/123.pdf_metadata.json']
2025-03-05 13:52:00,763 WARNING dsc.workflows.base.remove_dspace_metadata(): Deleting 1 DSpace metadata JSON file(s)
2025-03-05 13:52:00,817 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-03-05 13:52:00,817 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:00.353228

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES | NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@coveralls
Copy link

coveralls commented Mar 5, 2025

Pull Request Test Coverage Report for Build 13707023220

Details

  • 44 of 48 (91.67%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.2%) to 95.092%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/workflows/base/init.py 30 34 88.24%
Totals Coverage Status
Change from base Build 13706963730: -0.2%
Covered Lines: 775
Relevant Lines: 815

💛 - Coveralls

@jonavellecuerdo jonavellecuerdo force-pushed the INFRA-509-sccs-deposit-request branch 2 times, most recently from f0f9860 to a94f511 Compare March 5, 2025 20:59
@jonavellecuerdo jonavellecuerdo mentioned this pull request Mar 6, 2025
9 tasks
Base automatically changed from IN-1174-create-submit-report to main March 6, 2025 19:50
@jonavellecuerdo jonavellecuerdo changed the title Update SCCS metadata mapping JSON file Update SCCS metadata mapping JSON file and support reruns Mar 6, 2025
Why these changes are being introduced:
* The metadata mapping JSON file required the addition of a few fields
in order to fulfill a batch deposit request. It was decided to take
the approach of updating the metadata mapping JSON file as needed
rather than reworking the metadata mapping process at this stage.

How this addresses that need:
* Add 'dc.contributor.department' and 'dc.description.abstract' as fields
* Replace 'nan' with None when reading records from metadata CSV file

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/INFRA-509
@jonavellecuerdo jonavellecuerdo force-pushed the INFRA-509-sccs-deposit-request branch from a94f511 to 7a0a2d8 Compare March 6, 2025 19:54
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review March 6, 2025 19:55
@jonavellecuerdo jonavellecuerdo requested a review from ghukill March 6, 2025 19:55
@jonavellecuerdo
Copy link
Contributor Author

As discussed with @ghukill during a check-in, there may be solutions better suited to the ensure idempotency in the case of reruns. Closing this PR and will create a new PR with a subset of the commits above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants