-
Notifications
You must be signed in to change notification settings - Fork 0
Add flexibility to SimpleCSV workflow #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: * Initial DSC test runs uncovered that the S3Client.files_iter required some small fixes. How this addresses that need: * Skip the object key for the prefix subfolder (e.g., 'workflow/batch-aaa/') * Include the forward slash in Workflow.batch_path and update paths as needed * Update unit tests Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1186
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing some good changes in here, and not opposed to this approach of moving the item identifier flexibility into the workflow metadata JSON mapping.
Opted for a quick turnaround review with a fairly detailed question in a comment.
* Undo changes in SimpleCSV._get_item_identifier * Define @staticmethod SCCS._get_item_identifier * Create new test module 'test_workflow_sccs.py' * Remove 'item_identifier' from metadata mapping JSON files
Pull Request Test Coverage Report for Build 13528617403Details
💛 - Coveralls |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I think it's a nice adjustment given some testing on real world data.
Why these changes are being introduced: * Rework SimpleCSV methods to better handle modest variations in provided metadata CSV files (field names) and bitstreams (filenames). How this addresses that need: * Define SCCS._get_item_identifier to retrieve identifier from accepted range of values * Remove 'item_identifier' from metadata mapping JSON files * Rework SimpleCSV reconcile methods to match metadata and bitstreams based on presence of item identifier in bitstream filename; remove code to retrieve item identifiers from bitstream filenames; for bitstreams without metadata, indicate filenames instead * Use pandas to read metadata CSV file * Add and update unit tests Side effects of this change: * The column in the metadata CSV file that contains the item identifier can take on any name as long as it is listed in the 'source_field_name' for the 'item_identifier' entry in the metadata mapping JSON file. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1186
dc19443 to
e0e138e
Compare
Purpose and background context
This PR introduces changes that add flexibility to the SimpleCSV workflow, which were motivated by issues that were uncovered during test runs of two SCCS deposits (see IN-1186 for more details).
There are two commits in this PR:
S3Client.files_iterSimpleCSVI wrote most of the details in the commit messages, so please refer to them for more details!
The main change to
SimpleCSVis in thereconcilemethods. As discussed in the ticket, it became clear that retrieving item identifiers from bitstream filenames isn't possible unless file naming conventions are strictly followed and/or regular expressions are used to parse item identifiers from filenames (doesn't seem ideal if file naming conventions vary across deposits as we'd have to potentially write new regex to handle each new variation). The best information thatreconcilecan provide is:<item_identifier> in <filename>)The changes in the second commit implement this change.
How can a reviewer manually see the effects of these changes?
In your terminal, navigate to the
dspace-submission-composerrepo and checking intoIN-1186-simplecsv-fix-item-identifiers.Set AWS credentials for Dev.
Temporarily update
SCCSworkflow class attributes:Run
reconcilecommand:You should get the following output:
Includes new or updated dependencies?
YES - Installs
pandasChanges expectations for external applications?
YES - The column in the metadata CSV file that contains the item identifier
can take on any name as long as it is listed in the 'source_field_name'
for the 'item_identifier' entry in the metadata mapping JSON file.
What are the relevant tickets?
Developer
Code Reviewer(s)