Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploratory field method refactor #166

Draft
wants to merge 4 commits into
base: field-method-refactor
Choose a base branch
from

Conversation

ehanson8
Copy link
Contributor

Purpose and background context

This is intended to start the discussion over how to approach the refactoring of transmogrifier to use field methods. Please be thorough and ruthless so we have a good template before kicking off the refactor.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* This commit serves as a starting point for discussing the field method refactor of this application

How this addresses that need:
* Add get_contents and get_dates field method as examples for future refactoring
* Add private methods for get_dates methods as example of breaking up large code blocks
* Add unit tests as example of expected tests for all future field methods
* Add dspace_dim fixtures as examples for future test suite refactoring
* Organize fixtures in conftest.py

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-273
return next(source_records)


@pytest.fixture
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the _attribute_and_subfield_variations fixtures (which I didn't touch but have concerns about how useful they are), _errors fixtures are intended to hold all edge cases that would trigger logging or alternate behavior. Keeping them all in one fixture that is used by multiple edge case tests (test_get_dates_invalid_date_range_skipped) should keep the testing suite cleaner. Open to other names for this fixture

@pytest.fixture
def dspace_dim_record_optional_fields_blank():
source_records = DspaceDim.parse_source_file(
"tests/fixtures/dspace/dspace_dim_record_optional_fields_blank.xml"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should likely rename this fixture and _optional_fields_missing during the refactor

Comment on lines 262 to 263
def test_get_dates_invalid_date_range_skipped(dspace_dim_record_errors):
assert DspaceDim.get_dates(dspace_dim_record_errors, "abc123") == []
Copy link
Contributor Author

@ehanson8 ehanson8 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to success, blank, and missing; tests should be included for any edge cases handled differently by the code like this one. The amount of these tests would vary depending on both the source and field. This could be an area where we get bogged down so discussing some general principles would be useful to keep things on track

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

return fields

@classmethod
def get_contents(cls, xml: Tag) -> list[str]:
return [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose not bothering with docstrings for the field methods unless there is something unique to describe. """Get values from source record for TIMDEX XXXXXXXX field..""" for each field does not seem to add any value (even though I did that on aardvark 🙃 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agreed!

Comment on lines 252 to 270
@classmethod
def _get_coverage_dates(cls, xml: Tag, source_record_id: str) -> list[timdex.Date]:
coverage_dates = []
for coverage_value in [
str(coverage_element.string)
for coverage_element in xml.find_all(
"dim:field", element="coverage", qualifier="temporal", string=True
)
]:
if "/" in coverage_value:
date_object = cls._parse_date_range(coverage_value, source_record_id)
else:
date_object = timdex.Date(note=coverage_value, kind="coverage")
if date_object:
coverage_dates.append(date_object)
return coverage_dates

@classmethod
def _parse_date_range(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first pass, I propose minimizing the amount of refactoring to creating the field method from the appropriate get_optional_fields code block with the exception of overly large code blocks such as this. I don't know how many there will be but I expect dates will be a common culprit

def get_contents(cls, xml: Tag) -> list[str]:
return [
str(contents.string)
for contents in xml.find_all(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://mitlibraries.atlassian.net/browse/TIMX-276, we should soon evaluate whether we want to use lxml or BeautifulSoup to parse XML



def test_get_dates_success(dspace_dim_record_all_fields):
assert DspaceDim.get_dates(dspace_dim_record_all_fields, "abc123") == [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For field method _success tests, we should extract the values from the _all_fields_transforms_correctly test

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 I think this PR does a great job at explaining the main goal of the refactor by providing a few informative examples without making sweeping changes all at once! I also appreciated the comments you added to guide our reviews.

I proposed a few updates! Looking forward to our team discussion on this first pass! ✨

)
if t.string
] or None
fields["contents"] = self.get_contents(xml) or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, the inspiration behind these changes is the pattern we've established in our newer transformer MITAardvark, right? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! It's not perfect but the closest we've come to the expected end state after this refactor

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

return fields

@classmethod
def get_contents(cls, xml: Tag) -> list[str]:
return [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! :)

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

return fields

@classmethod
def get_contents(cls, xml: Tag) -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename xml -> source_record?

For instance, in the case of MITAardvark, we have:

  • Transformer.source_records
  • JSONTransformer.transform(cls, source_record: dict)
  • MITAardvark.<field_method>(cls, source_record: dict)

In that same vein, I think the following naming convention should apply:

  • Transformer.source_records
  • XMLTransformer.transform(cls, source_record: Tag)
  • DSpaceDim.<field_method>(cls, source_record: Tag)

If the purpose of this PR is to set an example for all other field methods, I'd strongly prefer making the change here! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll update that and yes, that's what this PR is intended for!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like source_record as well. That certainly handles either a JSONLines parsed dict or an XML parsed BS4 Tag or lxml Element (XML root).

split = coverage_value.index("/")
gte_date = coverage_value[:split]
lte_date = coverage_value[split + 1 :]
if validate_date_range(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this might be a separate ticket, but I think it's worth discussing where validation should occur. 🤔 It seems like we define validators as part of

from attrs.validators import instance_of, optional
.

This means that the all custom validators (except for date validators) are run during initialization of TimdexRecord instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good point and worthy of discussion, but this does feel like a slightly different type of validation than what we're doing init'ing a TimdexRecord which is more checking types rather than the value's content. I'll add it to the agenda!

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this first pass at some "field methods" to provide a backdrop for discussions.

My comments contain the bulk of my thoughts so far, but to summarize here and add a couple others:

  • thinking we might want to consider small, stubbed source_record instances that are defined in each test vs. complicated fixtures
  • think the decision of BS4 vs lxml will have some bearing, but not a lot, on what field methods look like (e.g. if we're using XPath, to consider some helper methods)
  • very curious to talk about orchestration of these field methods, as I'm confident that will have some impact on their ergonomics
  • if a field method requires sub-methods as helpers, think about naming them in relation to the part of the data structure they are accessing if appropriate... or a variety of the TIMDEX field they further breaking down if more applicable. I think this will be hard to get "right", so probably just emerge over time.

Overall though, I think the get_contents() and get_dates() are looking like what I had hoped and expected! Concise, clear methods that pull a value for a particular field.

Looking forward towards more discussion, I think if we did want helper methods on XML or JSON objects for things like Xpath, then it's possible the source_record we pass to the field methods might have more capabilities, and therefore also potentially change those signatures. Specifically, I've always thought it's a little awkward to pass the source_record_id into the methods. If we pass an actual SourceRecord object, per say, we could have access to a source_record.identifier for example. Just food for thought in the discussion.

Comment on lines 239 to 248
for date_element in xml.find_all("dim:field", element="date", string=True):
date_value = str(date_element.string.strip())
if validate_date(date_value, source_record_id):
if date_element.get("qualifier") == "issued":
date_object = timdex.Date(value=date_value, kind="Publication date")
else:
date_object = timdex.Date(
value=date_value, kind=date_element.get("qualifier") or None
)
dates.append(date_object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might propose -- particularly for something like dates -- that even this could be broken out into a submethod like _get_issued_and_other_dates().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing that both of these looking for <date> and <coverage> elements, they are XML element focused. If an attribute is present, then may give a specific kind value, but otherwise we just save the date, for both methods.

It might be worth considering if the sub-methods are oriented around the XML structure they are working with?

For example:

def get_dates():
  dates = []
  dates.extend(self._get_dates_from_date_elements())
  dates.extend(self._get_dates_from_coverage_elements())

I'm a little unsure when it's better to pivot from the TIMDEX field we're working on, to the XML element we're using, when it comes to method and code organization. But maybe worth considering. If the "field methods" are TIMDEX-y (named to describe what TIMDEX fields it will set), but then sub methods are source record-y (named to describe what part of the source record structure the data comes from), maybe that could be pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

def get_dates():
  dates = []
  dates.extend(self._get_dates_from_date_elements())
  dates.extend(self._get_dates_from_coverage_elements())

I like the logic of that pattern, let's discuss at meeting

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

return fields

@classmethod
def get_contents(cls, xml: Tag) -> list[str]:
return [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also agreed!

Comment on lines 1 to 12
<records>
<record>
<metadata>
<dim:dim xmlns:dim="http://www.dspace.org/xmlns/dspace/dim"
xmlns:doc="http://www.lyncode.com/xoai"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dspace.org/xmlns/dspace/dim http://www.dspace.org/schema/dim.xsd">
<dim:field element="coverage" qualifier="temporal">2020-01-02/2019-01-01</dim:field>
</dim:dim>
</metadata>
</record>
</records>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to touch on test fixtures in our discussion. I definitely see the thinking here of having a record with only data that will exercise error handling in the field methods, but I begin to question what this record is other than a vehicle for those XML elements; it does not strike me as a valid, realistic record anymore.

It could be confusing in the future when we look at a fixture like dspace_dim_record_errors.xml expecting to see a realistic DSpace record, just with problematic fields, but instead find just XML elements that will trigger specific field method error handling, without other fields like title, description, etc., that would be common to see.

Perhaps an alternate approach could be to craft problematic XML documents, perhaps with only a single field, in the test itself? Saving fixtures for "realistic"-ish records?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, slightly worried about messy tests with the XML stubs but let's discuss

Comment on lines 221 to 222
def test_get_contents_success(dspace_dim_record_all_fields):
assert DspaceDim.get_contents(dspace_dim_record_all_fields) == ["Chapter 1"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of the comment above about fixtures to exercise field methods, wondering how we might be able to send smaller, more targeted values/documents into the field method for testing?

Something like:

from bs4 import BeautifulSoup

# define a utility method for each source that provides the template
# and accepts an XML block of elements to insert
def create_dspace_source_record_stub(xml_insert: str):
    xml_str = f"""
        <records>
            <record>
                <metadata>
                    <dim:dim xmlns:dim="http://www.dspace.org/xmlns/dspace/dim"
                        xmlns:doc="http://www.lyncode.com/xoai"
                        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                        xsi:schemaLocation="http://www.dspace.org/xmlns/dspace/dim http://www.dspace.org/schema/dim.xsd">
                        {xml_insert}
                    </dim:dim>
                </metadata>
            </record>
        </records>    
        """
    return BeautifulSoup(xml_str, "xml")

# to test how contents are parsed, we pass only 
# the string of the XML that we expect our field method to test
def test_get_contents_success_v2():
    source_record = create_dspace_source_record_stub(
        """
        <dim:field mdschema="dc" element="description" qualifier="tableofcontents" lang="en">Chapter 1</dim:field>
        """
    )
    assert DspaceDim.get_contents(source_record) == ["Chapter 1"]

I think the overhead of setting up the source record "stub" function for each source could be made up by the tests themselves being pretty self-contained.

We'd want to think about when or if it did make sense to have fully realized records as fixtures. Perhaps if we are interested in the orchestration of field methods, or how and when documents are parsed by the XMLTransformer or JSONTransformer. But this, or something similar, feels worth considering as means to avoid the headache of naming and managing fixtures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we opted to go down this path, we'd probably want to think a bit about the utility function to create the stubbed record: whether it should be in conftest or the testing file itself, etc. There are probably also some ramifications if we use BS4 vs lxml, but realistically likely not that many.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good model and the logic is much cleaner than my proposed _errors fixture

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be interested to really kick around in discussion though. Because, thankfully, I don't think we have many (any?) transforms that look at element <foo> and element <bar> to determine a value for the final TIMDEX record, I think these XML blocks we insert will be an element or two.

If we were in the business of taking an element like <foo> and then comparing against other aspects of the record that are kind of tangential to that element, but help inform how to interpret, than this pattern could get cumbersome. But, maybe that'd be a good instance for a pytest fixture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I believe marc has some examples involving 2 fields but not many

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

return fields

@classmethod
def get_contents(cls, xml: Tag) -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like source_record as well. That certainly handles either a JSONLines parsed dict or an XML parsed BS4 Tag or lxml Element (XML root).

@ehanson8
Copy link
Contributor Author

Specifically, I've always thought it's a little awkward to pass the source_record_id into the methods. If we pass an actual SourceRecord object, per say, we could have access to a source_record.identifier for example. Just food for thought in the discussion.

Agreed, I hated adding that param in this commit because source_record.identifier should be a thing!

* Rename field method param xml > source_record
* Refactor get_dates method for clarity
* Remove _errors fixture
* Add create_dspace_dim_source_record_stub function and use stubs records in dspace_dim unit tests
assert DspaceDim.get_dates(dspace_dim_record_all_fields, "abc123") == [
def test_get_dates_success():
source_record = create_dspace_dim_source_record_stub(
'<dim:field mdschema="dc" element="coverage" qualifier="temporal">'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aesthetically, I don't love this but I do appreciate having everything right there instead of having to look in an XML file for the relevant values. Happy to see a better way of addressing the long line issue if you have one!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is probably the trickiest part to get ergonomically right.

Here's an option:

# at top of file, blanket skip long lines in a test file, which I'd be okay with
# ruff: noqa: E501

# then in test, can prevent auto-formatting for a block of code with #fmt: off / #fmt: on
source_record = create_dspace_dim_source_record_stub(
    # fmt: off
    """
    <dim:field mdschema="dc" element="coverage" qualifier="temporal">1201-01-01 - 1965-12-21</dim:field>
    <dim:field mdschema="dc" element="coverage" qualifier="temporal">1201-01-01/1965-12-21</dim:field>
    <dim:field mdschema="dc" element="date" qualifier="accessioned">2009-01-08T16:24:37Z</dim:field>
    <dim:field mdschema="dc" element="date" qualifier="available">2009-01-08T16:24:37Z</dim:field>
    <dim:field mdschema="dc" element="date" qualifier="issued">2002-11</dim:field>
    <dim:field mdschema="dc" element="identifier" qualifier="uri">https://hdl.handle.net/1912/2641</dim:field>
    """
    # fmt: on
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants