Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timx 283 oaidc field method refactor #177

Merged
merged 1 commit into from
May 28, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented May 21, 2024

Purpose and background context

Field method refactor for base transform class OaiDc.

Note: Linters are failing for transmogrifiers.sources.xml.springshare, which will be addressed in a later PR!

How can a reviewer manually see the effects of these changes?

  1. Run make test and verify all tests are passing.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this May 21, 2024
xml = next(transformer_instance.source_records)
assert transformer_instance.get_dates("test_source_record_id", xml) == [
timdex.Date(kind="Unknown", note=None, range=None, value="2008-06-19T17:55:27")
assert OaiDc("cool-repo", source_records).get_content_type() == ["cool-repo"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the content_type is derived from the source instance variable, OaiDc must be instantiated for the test.

Comment on lines 200 to 207
def test_get_identifiers_transforms_correctly_if_fields_blank(
oai_dc_record_optional_fields_blank,
):
oai_dc_record_optional_fields_blank.header.identifier.clear()
assert OaiDc.get_identifiers(oai_dc_record_optional_fields_blank) == []


def test_get_identifiers_transforms_correctly_if_fields_missing(
oai_dc_record_optional_fields_missing,
):
oai_dc_record_optional_fields_missing.header.identifier.decompose()
assert OaiDc.get_identifiers(oai_dc_record_optional_fields_missing) == []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what you think of this setup. 🤔 The identifiers are derived from a child of the header element, so using the stub (as it is written) or the oai_dc_record_* fixtures directly was not possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, good call out. This is probably one of the inherent issues with the stubs.

Though I agree this works, another option could be modifying the function that creates the stub to one that allows both header overrides and actual content?

Maybe that's a strong argument for keeping these stubs in the test suite for that source itself vs the higher level abstraction proposed in this ticket.

Example:

def create_oaidc_source_record_stub(
  header_identifier:str = "oai:libguides.com:guides/123456",  #<------- new
  xml_insert: str = ""
) -> BeautifulSoup:
    xml_str = f"""
        <records>
            <record>
                <header>
                    <identifier>{header_identifier}</identifier>  #<---------- templated here
                    <datestamp>2023-05-31T19:49:21Z</datestamp>
                    <setSpec>guides</setSpec>
                </header>
                <metadata>
                    <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                        xmlns:dc="http://purl.org/dc/elements/1.1/"
                        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                        xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
                        {xml_insert}
                    </oai_dc:dc>
                </metadata>
            </record>
        <records
    """
    return BeautifulSoup(xml_str, "xml")

Just a thought exercise though, as I agree the method used above works well too.

Copy link
Contributor Author

@jonavellecuerdo jonavellecuerdo May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update from @ghukill : Whole header element should be editable via param to stub method.

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-283-springshare-field-method-refactor branch from cd5569a to dd69c7d Compare May 21, 2024 18:57
@jonavellecuerdo jonavellecuerdo changed the title Timx 283 springshare field method refactor Timx 283 oaidc field method refactor May 21, 2024
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review May 21, 2024 19:01
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some very minor suggestions

transmogrifier/sources/xml/oaidc.py Show resolved Hide resolved
transmogrifier/sources/xml/oaidc.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/oaidc.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/oaidc.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/oaidc.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/oaidc.py Outdated Show resolved Hide resolved
Comment on lines 200 to 207
def test_get_identifiers_transforms_correctly_if_fields_blank(
oai_dc_record_optional_fields_blank,
):
oai_dc_record_optional_fields_blank.header.identifier.clear()
assert OaiDc.get_identifiers(oai_dc_record_optional_fields_blank) == []


def test_get_identifiers_transforms_correctly_if_fields_missing(
oai_dc_record_optional_fields_missing,
):
oai_dc_record_optional_fields_missing.header.identifier.decompose()
assert OaiDc.get_identifiers(oai_dc_record_optional_fields_missing) == []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that works

tests/sources/xml/test_oai_dc.py Outdated Show resolved Hide resolved
tests/sources/xml/test_oai_dc.py Outdated Show resolved Hide resolved
@ehanson8
Copy link
Contributor

And let's resolve the Signature of "get_dates" linter issue before we merge this PR, happy to assist

@jonavellecuerdo jonavellecuerdo marked this pull request as draft May 22, 2024 13:31
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-283-springshare-field-method-refactor branch 2 times, most recently from a3ee3cf to 45a75bc Compare May 22, 2024 19:42
return str(xml.find("dc:identifier").string)
if source_link := source_record.find("dc:identifier", string=True):
return str(source_link.string)
return ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the method would throw an attribute error if dc:identifier did not appear in the XML:

AttributeError: 'NoneType' object has no attribute 'string'

Given that source_link is a required field, raising an error seems to be the right approach, but is there a cleaner way to handle this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your instinct to use SkippedRecordEvent here is correct, @ghukill your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If we can't pull a dc:identifier value, then we don't have a source link to send people to, and it feels like it's not worth keeping the record (punting the awareness of this to later observability/metrics discussions).

So yes, raising a SkippedRecordEvent feels like the right move!

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review May 22, 2024 20:04
@jonavellecuerdo
Copy link
Contributor Author

@ghukill @ehanson8 The PR has now includes required updates to the SpringshareOaiDc transform.

Note(s):

  1. Renaming SpringshareOaiDc -> Springshare was floated during our last meeting. If we choose to do this, I will submit via a separate PR --to ease review of changes.
  2. While working on this, I wondered: what should field methods return in the event if data for a required field is missing? See comment.
    • @ehanson8 Is SpringshareOaiDc.get_source_link an instance where the new exception SkippedRecordEvent should be used? 🤔

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-283-springshare-field-method-refactor branch from 45a75bc to a05fc00 Compare May 22, 2024 20:14
from bs4 import BeautifulSoup

import transmogrifier.models as timdex
from transmogrifier.sources.xml.oaidc import OaiDc


def create_oai_dc_source_record_stub(xml_insert: str = "") -> BeautifulSoup:
def create_oaidc_source_record_stub(xml_insert: str = "") -> BeautifulSoup:
Copy link
Contributor

@ehanson8 ehanson8 May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the default, I'll update my PR with that. Actually, let's just use the default with the new global stub function that will be created

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had agreed with the comment until the strikethrough, lolz!

See this comment about how maybe keeping the function closer to the tests is handy, given we can modify the signature to set values as needed in the stub. Maybe good fodder for our discussion today.

To me, feels like either approach is still kind of on the table.

tests/conftest.py Outdated Show resolved Hide resolved
)


def test_get_dates_transforms_correctly_and_logs_error_if_date_invalid(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect example of an edge case test!

@@ -51,16 +52,16 @@ def get_dates(self, source_record_id: str, xml: Tag) -> list[timdex.Date]:
)
return dates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the other PR, this return can be updated with or None

return str(xml.find("dc:identifier").string)
if source_link := source_record.find("dc:identifier", string=True):
return str(source_link.string)
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your instinct to use SkippedRecordEvent here is correct, @ghukill your thoughts?

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting some comments now for our discussion, with the intent of completing a further review later.

return str(xml.find("dc:identifier").string)
if source_link := source_record.find("dc:identifier", string=True):
return str(source_link.string)
return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If we can't pull a dc:identifier value, then we don't have a source link to send people to, and it feels like it's not worth keeping the record (punting the awareness of this to later observability/metrics discussions).

So yes, raising a SkippedRecordEvent feels like the right move!

tests/conftest.py Outdated Show resolved Hide resolved
from bs4 import BeautifulSoup

import transmogrifier.models as timdex
from transmogrifier.sources.xml.oaidc import OaiDc


def create_oai_dc_source_record_stub(xml_insert: str = "") -> BeautifulSoup:
def create_oaidc_source_record_stub(xml_insert: str = "") -> BeautifulSoup:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had agreed with the comment until the strikethrough, lolz!

See this comment about how maybe keeping the function closer to the tests is handy, given we can modify the signature to set values as needed in the stub. Maybe good fodder for our discussion today.

To me, feels like either approach is still kind of on the table.


# languages: not set in this transformation

# links
fields["links"] = self.get_links(source_record_id, xml) or None
fields["links"] = self.get_links(source_record, source_record_id) or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to pass source_record_id, if the source_record contains it?

And/or, could the method get_links() just call the other method get_source_record_id() again if not?

Feeling like maybe this is touched on in other comments. Full disclaimer: I wrote these transforms fairly early on, and recall they had requirements that deviated from previous Transmog sources (as I think you've commented on as well @jonavellecuerdo, like the identifier vs source link). So it feels possible that now is a good time to revisit some of those awkward edges. Silver lining: these are not yet live, so I think we're fairly free to modify them somewhat heavily if needed.

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-283-springshare-field-method-refactor branch from a05fc00 to 8b946ff Compare May 23, 2024 19:06
@jonavellecuerdo
Copy link
Contributor Author

@ghukill @ehanson8 Please see the latest three (3) commits for updates based on our sync yesterday! See Summary of changes via Field Method Refactor
for more details.

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, just a few more suggestions!

"""
<dc:creator>Ye Li</dc:creator>
"""
metadata_insert=(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on named args!

):
oai_dc_record_all_fields.header.identifier.clear()
assert OaiDc.get_source_record_id(oai_dc_record_all_fields) == ""
def test_get_source_record_id_transforms_properly_if_fields_blank():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to indicate it's raising the exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to test_get_source_record_id_raises_skipped_record_event_if_fields_blank().

):
oai_dc_record_all_fields.header.identifier.decompose()
assert OaiDc.get_source_record_id(oai_dc_record_all_fields) == ""
def test_get_source_record_id_transforms_properly_if_fields_missing():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to test_get_source_record_id_raises_skipped_record_event_if_fields_missing.

"The 'identifier' was either missing from the header element or blank."
),
):
SpringshareOaiDc.get_source_link("", "", source_record=source_record)


def test_get_source_link_transforms_correctly_if_required_fields_missing():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another rename to describe exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to test_get_source_link_raises_skipped_record_event_if_required_fields_blank and test_get_source_link_raises_skipped_record_event_if_required_fields_missing.

"""
),
)
assert OaiDc.get_dates(source_record=source_record) == [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think named args for a single param method might be overkill 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rule of thumb -- like anything, open to exceptions -- is use positional when the function signature is positional, and use named when the function signature is named. Keeps it simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill I'll follow this convention moving forward. I updated the function calls in tests to use positional arguments to match function signatures.

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, thinking it's going well! Appreciating that OaiDc and Springshare are kind of finicky, given there is heavy inheritance implications.

Left a couple of comments / questions.

Comment on lines +203 to +206
message = (
"Record skipped because 'source_record_id' could not be derived. "
"The 'identifier' was either missing from the header element or blank."
)
raise SkippedRecordEvent(message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether or not we end up layering in more granular exceptions, I think this is a great exception message. Ignoring when or how this is logged, this does explain why this record would be skipped.

One thing we might want to consider is how to control these messages so we could group them, even across sources. I think that's where more granular Exception types could come in. Kind of a balance/art between the Exception type, and the message. Will be easy to group and log and store metrics by exception type, but the string itself becomes most helpful likely for deep debugging a single record.

@@ -69,15 +73,20 @@ def get_links(self, source_record_id: str, xml: Tag) -> list[timdex.Link] | None
url=str(identifier.string),
)
)

# [TODO]: Message is logged even if the condition is met.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo - was this intended to be left in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. I think this is one of the logging messages that this issue refers to. I'll add a reference to this line of code as part of the issue and remove the comment to avoid linting errors.


def get_links(self, source_record_id: str, xml: Tag) -> list[timdex.Link] | None:
def get_links(self, source_record: Tag) -> list[timdex.Link] | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this is not a @classmethod?

I collapsed everything for a bird's eye view and this popped out:

Screenshot 2024-05-24 at 3 13 27 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo - thanks for the explanation! It's altogether possible that I introduced that when I first worked on these. I feel like there is a bit of code smell there, and I'm probably to blame, but feels like maybe it's worth revisiting during the 2nd orchestration phase of this work.

Comment on lines +173 to +177
header_insert=(
"""
<identifier>oai:libguides.com:guides/175846</identifier>
"""
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem harmful here, but I'm wondering if the header_insert is needed? Wouldn't that only be needed for tests that are looking to pull the identifier from it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-283-springshare-field-method-refactor branch from 32ce150 to dde85aa Compare May 28, 2024 15:48
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-283-springshare-field-method-refactor branch from dde85aa to 445448f Compare May 28, 2024 15:53
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the changes!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but remove references to source_record_id from the docstrings

source_record_id: Source record id
xml: A BeautifulSoup Tag representing a single OAI DC XML record.
source_record: A BeautifulSoup Tag representing a single OAI DC record in XML.
source_record_id: Source record ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed in the docstring since the param was removed

source_record_id: Source record id
xml: A BeautifulSoup Tag representing a single OAI DC XML record.
source_record: A BeautifulSoup Tag representing a single OAI DC record in XML.
source_record_id: Source record ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed in the docstring since the param was removed

source_record_id: Source record id
xml: A BeautifulSoup Tag representing a single OAI DC XML record.
source_record: A BeautifulSoup Tag representing a single OAI DC record in XML.
source_record_id: Source record ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed in the docstring since the param was removed

@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-283-springshare-field-method-refactor branch from 445448f to 3534a01 Compare May 28, 2024 20:48
Why these changes are being introduced:
* These updates are required to implement the architecture described
in the following ADR: https://github.com/MITLibraries/transmogrifier/blob/main/docs/adrs/0005-field-methods.md

How this addresses that need:
* Refactor base transform class OaiDc to contain get_*
methods for optional fields
* Make 'source_record' the first argument in OaiDc field methods
* Move 'or None' to the return statements of field methods
* Remove 'source_record_id' as param, replace with call inside field method instead
* Raise SkippedRecordEvent
  * OaiDc.get_source_record_id
  * SpringshareOaiDc.get_source_link
* Update unit tests
  * Remove OaiDc and SpringshareOaiDc record_* fixtures

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-283
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-283-springshare-field-method-refactor branch from 3534a01 to 537f6ec Compare May 28, 2024 20:52
@jonavellecuerdo jonavellecuerdo merged commit e351409 into main May 28, 2024
3 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-283-springshare-field-method-refactor branch May 28, 2024 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants