Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Feb 24, 2025

Purpose and background context

This PR introduces changes that add flexibility to the SimpleCSV workflow, which were motivated by issues that were uncovered during test runs of two SCCS deposits (see IN-1186 for more details).

There are two commits in this PR:

I wrote most of the details in the commit messages, so please refer to them for more details!

The main change to SimpleCSV is in the reconcile methods. As discussed in the ticket, it became clear that retrieving item identifiers from bitstream filenames isn't possible unless file naming conventions are strictly followed and/or regular expressions are used to parse item identifiers from filenames (doesn't seem ideal if file naming conventions vary across deposits as we'd have to potentially write new regex to handle each new variation). The best information that reconcile can provide is:

  • what item identifiers from the metadata CSV file matched to bitstreams (i.e., <item_identifier> in <filename>)
  • what filenames for bitstreams did not match
    The changes in the second commit implement this change.

How can a reviewer manually see the effects of these changes?

  1. In your terminal, navigate to the dspace-submission-composer repo and checking into IN-1186-simplecsv-fix-item-identifiers.

  2. Set AWS credentials for Dev.

  3. Temporarily update SCCS workflow class attributes:

    s3_bucket -> "wiley-files-dev-222053980223"
    output_queue -> "dss-wiley-output-dev"
    
  4. Run reconcile command:

    pipenv run dsc --workflow-name sccs --batch-id sccs-batch-2025-02-09-mit-annual-reports reconcile                                    

    You should get the following output:

2025-02-25 08:35:20,141 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-02-25 08:35:20,141 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-02-25 08:35:20,141 INFO dsc.cli.main(): Running process
2025-02-25 08:35:20,142 INFO dsc.workflows.base.simple_csv.reconcile_bitstreams_and_metadata(): Reconciling bitstreams and metadata for batch 'sccs-batch-2025-02-09-mit-annual-reports'
2025-02-25 08:35:22,310 INFO dsc.workflows.base.simple_csv.reconcile_bitstreams_and_metadata(): Successfully reconciled bitstreams and metadata for all 106 item(s)
2025-02-25 08:35:22,312 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-02-25 08:35:22,312 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:02.171109

Includes new or updated dependencies?

YES - Installs pandas

Changes expectations for external applications?

YES - The column in the metadata CSV file that contains the item identifier
can take on any name as long as it is listed in the 'source_field_name'
for the 'item_identifier' entry in the metadata mapping JSON file.

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* Initial DSC test runs uncovered that the S3Client.files_iter required
some small fixes.

How this addresses that need:
* Skip the object key for the prefix subfolder (e.g., 'workflow/batch-aaa/')
* Include the forward slash in Workflow.batch_path and update paths as needed
* Update unit tests

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1186
@jonavellecuerdo jonavellecuerdo changed the title In 1186 simplecsv fix item identifiers Add flexibility to SimpleCSV workflow Feb 25, 2025
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review February 25, 2025 13:37
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing some good changes in here, and not opposed to this approach of moving the item identifier flexibility into the workflow metadata JSON mapping.

Opted for a quick turnaround review with a fairly detailed question in a comment.

jonavellecuerdo added a commit that referenced this pull request Feb 25, 2025
* Undo changes in SimpleCSV._get_item_identifier
* Define @staticmethod SCCS._get_item_identifier
* Create new test module 'test_workflow_sccs.py'
* Remove 'item_identifier' from metadata mapping JSON files
@coveralls
Copy link

coveralls commented Feb 25, 2025

Pull Request Test Coverage Report for Build 13528617403

Details

  • 37 of 39 (94.87%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.002%) to 94.22%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/utilities/aws/s3.py 3 4 75.0%
dsc/workflows/sccs.py 6 7 85.71%
Totals Coverage Status
Change from base Build 13422831257: -0.002%
Covered Lines: 652
Relevant Lines: 692

💛 - Coveralls

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think it's a nice adjustment given some testing on real world data.

Why these changes are being introduced:
* Rework SimpleCSV methods to better handle modest variations
in provided metadata CSV files (field names) and
bitstreams (filenames).

How this addresses that need:
* Define SCCS._get_item_identifier to retrieve identifier from accepted
range of values
* Remove 'item_identifier' from metadata mapping JSON files
* Rework SimpleCSV reconcile methods to match metadata and bitstreams
based on presence of item identifier in bitstream filename;
remove code to retrieve item identifiers from bitstream filenames;
for bitstreams without metadata, indicate filenames instead
* Use pandas to read metadata CSV file
* Add and update unit tests

Side effects of this change:
* The column in the metadata CSV file that contains the item identifier
can take on any name as long as it is listed in the 'source_field_name'
for the 'item_identifier' entry in the metadata mapping JSON file.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1186
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1186-simplecsv-fix-item-identifiers branch from dc19443 to e0e138e Compare February 25, 2025 18:45
@jonavellecuerdo jonavellecuerdo merged commit 4637460 into main Feb 25, 2025
2 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1186-simplecsv-fix-item-identifiers branch February 25, 2025 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants