-
Notifications
You must be signed in to change notification settings - Fork 0
In 1096 ocw workflow #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9f54f9f to
5097b39
Compare
Pull Request Test Coverage Report for Build 13248106939Details
💛 - Coveralls |
| "dc.contributor.author": { | ||
| "source_field_name": "instructor", | ||
| "language": "en_US", | ||
| "delimiter": "|" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping is made possible by the transformation of the instructors property when _read_metadata_json_file() method is called.
| def _construct_instructor_name(instructor: dict[str, str]) -> str: | ||
| """Given a dictionary of name fields, derive instructor name.""" | ||
| if not (last_name := instructor.get("last_name")) or not ( | ||
| first_name := instructor.get("first_name") | ||
| ): | ||
| return "" | ||
| return f"{last_name}, {first_name} {instructor.get("middle_initial", "")}".strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is plausible that all the metadata in data.json will always be formatted as needed (i.e., all instructor name fields provided), it would be a good idea to check in with stakeholders (IN-1156) on the "minimum required instructor name fields` to construct an instructor name.
In this sample mapping file we received, ocw_json_to_dspace_mapping.xlsx, it indicates the instructor names must be formatted as:
<last_name>, <first_name> <middle_initial>
The code above will return an empty string if either the last_name or first_name is missing; it allows for missing middle_initial values.
7238447 to
abcf2ae
Compare
|
Met with @ghukill last Friday and wanted to share some thoughts from our discussion:
|
abcf2ae to
762997e
Compare
|
Agree that 1 & 2 should be handled as separate tickets! |
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great but a few requested changes!
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looking good to me. Thanks for the discussion the other day, which was quite helpful.
I left a couple of comments/suggestions for fairly minor, syntactical things. None are required or blocking.
I did have another comment that this PR surfaced for me. Start by saying perhaps this is could be another ticket for exploring.
Should the CLI command reconcile potentially return a non-zero exit code? Thinking forward to automation, throwing an exit code like 1 or 2 would indicate that reconciliation had failed in some way. This could be helpful for humans, but more helpful for automation like StepFunctions.
I think this came to mind in this PR given the conversations about reconciling and what it's conceptually communicating.
* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'
* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'
145447e to
655da2e
Compare
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's looking great!
My only request is a standalone method for the item identifier parsing, to ensure it's always done the same way for any part of the workflow that attempts to establish it for an item.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I concur with Graham's method and ticket suggestions and then full approval!
* Add 'parse_item_identifier' method * Move creation of 'zip_file_uri' to '_extract_metadata_from_zip_file' method
|
Had to reapply the changes to rename the test module for |
ghukill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my POV, even with a nit-pick comment about using a variable rather then re-calling a method.
Slowing down on this PR and working on the reconcile methods, and backlogging some tickets, really feels like the right approach. Hat tip to all involved here for moving the project along in a comprehensive way.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Agree on removing the the dupe parse_item_identifier but fully approved!
* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'
* Add 'parse_item_identifier' method * Move creation of 'zip_file_uri' to '_extract_metadata_from_zip_file' method
7abbc7c to
86b0783
Compare
Why these changes are being introduced: * Support OpenCourseWare deposits requested by Technical Services staff. How this addresses that need: * Define custom methods to extract metadata from 'data.json' * Define custom 'get_bitstream_s3_uris' to filter to zip files * Define custom methods to reconcile bitstreams with item metadata (i.e., identify zip files without 'data.json' files) * Create OpenCourseWare metadata mapping JSON file * Add unit tests Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/IN-1096
* Update docstring for OpenCourseWare workflow class * Use 'removeprefix' over 'replace' * Include assertion to check for logged 'FileNotFoundError'
* Add 'parse_item_identifier' method * Move creation of 'zip_file_uri' to '_extract_metadata_from_zip_file' method
86b0783 to
4347a5a
Compare
* Remove root folder from test zip files to emulate OCW zip file structure * Remove 'item_identifier' arg from OpenCourseWare._extract_metadata_from_zip_file * Remove 'metadata_json' arg from OpenCourseWare._read_metadata_json_file * Remove stub method 'process_deposit_results'
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work!
Purpose and background context
This PR creates the
OpenCourseWareDSC workflow.This also introduces the following changes:
SimpleCSVworkflow. The result is that when we runmake test, workflow test results appear in the terminal together.How can a reviewer manually see the effects of these changes?
A. Review the added unit tests.
Note: The only custom method defined for
OpenCourseWarewithout a unit test is theitem_metadata_itermethod. See method B for testing with MinIO server.OpenCourseWaremethods.B. Optional but highly recommended (especially for future development).
Run
OpenCourseWarecommands using local MinIO server.Prerequisite
Follow instructions in README: Running a Local MinIO Server.
Note: As of this writing, the root password set for the local MinIO server must be at least 8 characters long. Didn't want to write this requirement in the README as it is subject to change if/when we download updated versions of the MinIO Docker image.
Mock out the local MinIO server with test zip files.
Note: I did these steps via the WebUI.
dscbucket:dsc/opencourseware/batch-00/It is not important to mock other files as the bitstream for OpenCourseWare deposits is the zip file itself.
data.json.data.jsondsc/opencourseware/batch-01/Add the following environment variables in your
.envfile.OpenCourseWarecommandsLaunch Python in your terminal:
pipenv run pythonitem_metadata_iter()result forbatch-00.You should see the following output:
[ { "item_identifier": "abc123", "course_title": "Matrix Calculus for Machine Learning and Beyond", "course_description": "We all know that calculus courses.", "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8", "instructors": "Edelman, Alan|Johnson, Steven G.", }, { "item_identifier": "def456", "course_title": "Burgers and Beyond", "course_description": "Investigating the paranormal, one burger at a time.", "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8", "instructors": "Burger, Cheese E.", }, ]item_metadata_iter()result forbatch-01.You should see the following output:
An
FileNotFoundErroris raised if any zip file is missing metadata (i.e., thedata.jsonfile)CLI command:
reconcilereconcileforbatch-00in your terminal:You should see the following output [REDACTED]:
reconcileforbatch-01in your terminal:You should see the following output [REDACTED]:
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)