Skip to content

Conversation

@jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Jan 23, 2025

Purpose and background context

This PR creates the OpenCourseWare DSC workflow.

This also introduces the following changes:

How can a reviewer manually see the effects of these changes?

A. Review the added unit tests.
Note: The only custom method defined for OpenCourseWare without a unit test is the item_metadata_iter method. See method B for testing with MinIO server.


B. Optional but highly recommended (especially for future development).
Run OpenCourseWare commands using local MinIO server.

Prerequisite

  1. Follow instructions in README: Running a Local MinIO Server.
    Note: As of this writing, the root password set for the local MinIO server must be at least 8 characters long. Didn't want to write this requirement in the README as it is subject to change if/when we download updated versions of the MinIO Docker image.

  2. Mock out the local MinIO server with test zip files.
    Note: I did these steps via the WebUI.

    • Create paths (i.e., prefix) in the dsc bucket:
      • dsc/opencourseware/batch-00/
        • Upload two (2) sample zip files with metadata
          It is not important to mock other files as the bitstream for OpenCourseWare deposits is the zip file itself.
          • abc123.zip: Zip file containing a single data.json.
          • def456.zip: Zip file containing a single data.json
      • dsc/opencourseware/batch-01/
        • Upload one (1) sample zip file without metadata.
  3. Add the following environment variables in your .env file.

    AWS_ENDPOINT_URL=http://localhost:9000/
    AWS_ACCESS_KEY_ID=<local-minio-username>
    AWS_SECRET_ACCESS_KEY=<local-minio-password>
    

OpenCourseWare commands
Launch Python in your terminal: pipenv run python

  1. Check item_metadata_iter() result for batch-00.
from dsc.workflows import OpenCourseWare
opencourseware_workflow_instance = OpenCourseWare(collection_handle="blah", batch_id="batch-00", email_recipients="me@gmail.com")
item_metadata_iter = opencourseware_workflow_instance.item_metadata_iter()
list(item_metadata_iter)

You should see the following output:

[
    {
        "item_identifier": "abc123",
        "course_title": "Matrix Calculus for Machine Learning and Beyond",
        "course_description": "We all know that calculus courses.",
        "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8",
        "instructors": "Edelman, Alan|Johnson, Steven G.",
    },
    {
        "item_identifier": "def456",
        "course_title": "Burgers and Beyond",
        "course_description": "Investigating the paranormal, one burger at a time.",
        "site_uid": "2318fd9f-1b5c-4a48-8a04-9c56d902a1f8",
        "instructors": "Burger, Cheese E.",
    },
]
  1. Check item_metadata_iter() result for batch-01.
from dsc.workflows import OpenCourseWare
opencourseware_workflow_instance = OpenCourseWare(collection_handle="blah", batch_id="batch-01", email_recipients="me@gmail.com")
item_metadata_iter = opencourseware_workflow_instance.item_metadata_iter()
list(item_metadata_iter)

You should see the following output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jcuerdo/Documents/repos/dspace-submission-composer/dsc/workflows/opencourseware.py", line 60, in item_metadata_iter
    **self._extract_metadata_from_zip_file(zip_file, item_identifier),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jcuerdo/Documents/repos/dspace-submission-composer/dsc/workflows/opencourseware.py", line 76, in _extract_metadata_from_zip_file
    raise FileNotFoundError(
FileNotFoundError: The required file 'data.json' file was not found in the zip file: s3://dsc/opencourseware/batch-01/ghi789.zip

An FileNotFoundError is raised if any zip file is missing metadata (i.e., the data.json file)

CLI command: reconcile

  1. Run reconcile for batch-00 in your terminal:
pipenv run dsc -w "opencourseware" -c "abc123" -b "batch-00" -e "me@gmail.com" reconcile

You should see the following output [REDACTED]:

Loading .env environment variables...
2025-01-27 10:03:11,074 INFO root.configure_logger(): INFO
2025-01-27 10:03:11,075 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-01-27 10:03:11,075 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-27 10:03:11,075 INFO dsc.cli.main(): Running process
2025-01-27 10:03:11,094 INFO botocore.credentials.load(): Found credentials in environment variables.
2025-01-27 10:03:11,340 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:03:11,426 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:03:11,438 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
. . .
2025-01-27 10:03:11,515 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
2025-01-27 10:03:11,515 INFO dsc.cli.reconcile(): All item identifiers and bitstreams successfully matched
2025-01-27 10:03:11,515 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-01-27 10:03:11,515 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:00.440784
  1. Run reconcile for batch-01 in your terminal:
pipenv run dsc -w "opencourseware" -c "abc123" -b "batch-01" -e "me@gmail.com" reconcile

You should see the following output [REDACTED]:

Loading .env environment variables...
2025-01-27 10:06:44,845 INFO root.configure_logger(): INFO
2025-01-27 10:06:44,845 INFO dsc.cli.main(): Logger 'root' configured with level=INFO
2025-01-27 10:06:44,845 INFO dsc.cli.main(): No Sentry DSN found, exceptions will not be sent to Sentry
2025-01-27 10:06:44,845 INFO dsc.cli.main(): Running process
2025-01-27 10:06:44,857 INFO botocore.credentials.load(): Found credentials in environment variables.
2025-01-27 10:06:44,977 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:06:45,015 INFO botocore.configprovider.provide(): Found endpoint for s3 via: environment_global.
2025-01-27 10:06:45,023 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
...
2025-01-27 10:06:45,033 INFO botocore.httpchecksum.handle_checksum_body(): Skipping checksum validation. Response did not contain one of the following algorithms: ['crc32', 'sha1', 'sha256'].
2025-01-27 10:06:45,035 ERROR dsc.workflows.opencourseware._identify_bitstreams_with_metadata(): The required file 'data.json' file was not found in the zip file: s3://dsc/opencourseware/batch-01/ghi789.zip
2025-01-27 10:06:45,036 ERROR dsc.cli.reconcile(): No item identifiers found for these bitstreams: {'ghi789'}
2025-01-27 10:06:45,036 INFO dsc.cli.post_main_group_subcommand(): Application exiting
2025-01-27 10:06:45,036 INFO dsc.cli.post_main_group_subcommand(): Total time elapsed: 0:00:00.191134

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Jan 23, 2025
@jonavellecuerdo jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch 2 times, most recently from 9f54f9f to 5097b39 Compare January 23, 2025 18:09
@coveralls
Copy link

coveralls commented Jan 23, 2025

Pull Request Test Coverage Report for Build 13248106939

Details

  • 64 of 71 (90.14%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.8%) to 95.706%

Changes Missing Coverage Covered Lines Changed/Added Lines %
dsc/workflows/opencourseware.py 62 69 89.86%
Totals Coverage Status
Change from base Build 13245892187: -0.8%
Covered Lines: 624
Relevant Lines: 652

💛 - Coveralls

Comment on lines 32 to 26
"dc.contributor.author": {
"source_field_name": "instructor",
"language": "en_US",
"delimiter": "|"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +132 to +182
def _construct_instructor_name(instructor: dict[str, str]) -> str:
"""Given a dictionary of name fields, derive instructor name."""
if not (last_name := instructor.get("last_name")) or not (
first_name := instructor.get("first_name")
):
return ""
return f"{last_name}, {first_name} {instructor.get("middle_initial", "")}".strip()
Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is plausible that all the metadata in data.json will always be formatted as needed (i.e., all instructor name fields provided), it would be a good idea to check in with stakeholders (IN-1156) on the "minimum required instructor name fields` to construct an instructor name.

In this sample mapping file we received, ocw_json_to_dspace_mapping.xlsx, it indicates the instructor names must be formatted as:

<last_name>, <first_name> <middle_initial>

The code above will return an empty string if either the last_name or first_name is missing; it allows for missing middle_initial values.

@jonavellecuerdo jonavellecuerdo force-pushed the IN-1096-ocw-workflow branch 4 times, most recently from 7238447 to abcf2ae Compare January 24, 2025 20:50
@jonavellecuerdo
Copy link
Contributor Author

jonavellecuerdo commented Jan 27, 2025

Met with @ghukill last Friday and wanted to share some thoughts from our discussion:

  1. Clarify language around the reconcile method: The base Workflow.reconcile_bitstreams_and_metadata method uses the terms "item_identifiers_without_bitstreams" and "bitstreams_without_item_identifiers". The use of these terms felt a bit awkward with the OpenCourseWare workflow for the following reason:

    . . . the 'reconcile' method only determines whether there are any bitstreams without metadata (any zip files without a 'data.json'). Metadata without bitstreams are basically impossible because the metadata ('data.json') is inside the bitstream (zip file)

    My interpretation of reconcile is that it is really about determining whether there is metadata provided for the bitstream. In the case of OpenCourseWare, all the bitstreams (the zip files) have "item_identifiers" because these are provided in the bitstream (the zip file) filename. For this reason, I chose different naming conventions in OpenCourseWare.reconcile_bitstreams_and_metadata.

    However, the language in the messages logged by the reconcile CLI command also uses the terms "item_identifiers_without_bitstreams" and "bitstreams_without_item_identifiers", which, again, feels a bit awkward in the case of OpenCourseWare workflow (and potentially other future workflows 🤔).

    @ghukill had mentioned that you and him had discussed moving the logging out of the CLI command and into the workflow classes. As this work concerns the base Workflow class, I propose a separate ticket for work to clarify the language -- variable names and logged messages-- around the reconcile method.

  2. Update method for creating DSpace metadata to support lists: The OpenCourseWare workflow includes a step to create a delimited string of instructor names. It includes a step where a list of formatted instructor names are retrieved and then joined by a delimiter. @ghukill proposed that the function could return a list instead if Workflow.create_dspace_metadata was updated to handle lists. As this workflow concerns the base Workflow class, I propose a separate ticket for this work -- and update OpenCourseWare workflow as part of that new ticket.

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review January 27, 2025 15:08
@ehanson8
Copy link
Contributor

Agree that 1 & 2 should be handled as separate tickets!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great but a few requested changes!

Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looking good to me. Thanks for the discussion the other day, which was quite helpful.

I left a couple of comments/suggestions for fairly minor, syntactical things. None are required or blocking.

I did have another comment that this PR surfaced for me. Start by saying perhaps this is could be another ticket for exploring.

Should the CLI command reconcile potentially return a non-zero exit code? Thinking forward to automation, throwing an exit code like 1 or 2 would indicate that reconciliation had failed in some way. This could be helpful for humans, but more helpful for automation like StepFunctions.

I think this came to mind in this PR given the conversations about reconciling and what it's conceptually communicating.

jonavellecuerdo added a commit that referenced this pull request Jan 28, 2025
* Update docstring for OpenCourseWare workflow class
* Use 'removeprefix' over 'replace'
* Include assertion to check for logged 'FileNotFoundError'
@jonavellecuerdo jonavellecuerdo marked this pull request as draft January 29, 2025 14:06
jonavellecuerdo added a commit that referenced this pull request Feb 3, 2025
* Update docstring for OpenCourseWare workflow class
* Use 'removeprefix' over 'replace'
* Include assertion to check for logged 'FileNotFoundError'
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review February 3, 2025 15:25
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's looking great!

My only request is a standalone method for the item identifier parsing, to ensure it's always done the same way for any part of the workflow that attempts to establish it for an item.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I concur with Graham's method and ticket suggestions and then full approval!

jonavellecuerdo added a commit that referenced this pull request Feb 3, 2025
* Add 'parse_item_identifier' method
* Move creation of 'zip_file_uri' to '_extract_metadata_from_zip_file' method
@jonavellecuerdo
Copy link
Contributor Author

Had to reapply the changes to rename the test module for SimpleCSV workflow via commit 7abbc7c. I dropped the first version of the commit from this branch to avoid rebasing issues with main (since new tests were added to SimpleCSV test module).

@jonavellecuerdo jonavellecuerdo requested review from a team, ehanson8 and ghukill February 3, 2025 22:00
Copy link

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my POV, even with a nit-pick comment about using a variable rather then re-calling a method.

Slowing down on this PR and working on the reconcile methods, and backlogging some tickets, really feels like the right approach. Hat tip to all involved here for moving the project along in a comprehensive way.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Agree on removing the the dupe parse_item_identifier but fully approved!

jonavellecuerdo added a commit that referenced this pull request Feb 10, 2025
* Update docstring for OpenCourseWare workflow class
* Use 'removeprefix' over 'replace'
* Include assertion to check for logged 'FileNotFoundError'
jonavellecuerdo added a commit that referenced this pull request Feb 10, 2025
* Add 'parse_item_identifier' method
* Move creation of 'zip_file_uri' to '_extract_metadata_from_zip_file' method
Why these changes are being introduced:
* Support OpenCourseWare deposits requested by Technical Services staff.

How this addresses that need:
* Define custom methods to extract metadata from 'data.json'
* Define custom 'get_bitstream_s3_uris' to filter to zip files
* Define custom methods to reconcile bitstreams with item metadata
(i.e., identify zip files without 'data.json' files)
* Create OpenCourseWare metadata mapping JSON file
* Add unit tests

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/IN-1096
* Update docstring for OpenCourseWare workflow class
* Use 'removeprefix' over 'replace'
* Include assertion to check for logged 'FileNotFoundError'
* Add 'parse_item_identifier' method
* Move creation of 'zip_file_uri' to '_extract_metadata_from_zip_file' method
* Remove root folder from test zip files to emulate OCW zip file structure
* Remove 'item_identifier' arg from OpenCourseWare._extract_metadata_from_zip_file
* Remove 'metadata_json' arg from OpenCourseWare._read_metadata_json_file
* Remove stub method 'process_deposit_results'
@jonavellecuerdo jonavellecuerdo marked this pull request as draft February 10, 2025 21:17
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review February 11, 2025 16:53
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work!

@jonavellecuerdo jonavellecuerdo merged commit 9b48184 into main Feb 11, 2025
2 checks passed
@jonavellecuerdo jonavellecuerdo deleted the IN-1096-ocw-workflow branch February 11, 2025 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants