Exploratory field method refactor #166

ehanson8 · 2024-04-25T17:12:59Z

Purpose and background context

This is intended to start the discussion over how to approach the refactoring of transmogrifier to use field methods. Please be thorough and ruthless so we have a good template before kicking off the refactor.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-273

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * This commit serves as a starting point for discussing the field method refactor of this application How this addresses that need: * Add get_contents and get_dates field method as examples for future refactoring * Add private methods for get_dates methods as example of breaking up large code blocks * Add unit tests as example of expected tests for all future field methods * Add dspace_dim fixtures as examples for future test suite refactoring * Organize fixtures in conftest.py Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-273

ehanson8 · 2024-04-25T17:25:31Z

tests/conftest.py

+    return next(source_records)
+
+
+@pytest.fixture


Unlike the _attribute_and_subfield_variations fixtures (which I didn't touch but have concerns about how useful they are), _errors fixtures are intended to hold all edge cases that would trigger logging or alternate behavior. Keeping them all in one fixture that is used by multiple edge case tests (test_get_dates_invalid_date_range_skipped) should keep the testing suite cleaner. Open to other names for this fixture

ehanson8 · 2024-04-25T17:27:44Z

tests/conftest.py

+@pytest.fixture
+def dspace_dim_record_optional_fields_blank():
+    source_records = DspaceDim.parse_source_file(
+        "tests/fixtures/dspace/dspace_dim_record_optional_fields_blank.xml"


We should likely rename this fixture and _optional_fields_missing during the refactor

ehanson8 · 2024-04-25T17:29:56Z

tests/sources/xml/test_dspace_dim.py

+def test_get_dates_invalid_date_range_skipped(dspace_dim_record_errors):
+    assert DspaceDim.get_dates(dspace_dim_record_errors, "abc123") == []


In addition to success, blank, and missing; tests should be included for any edge cases handled differently by the code like this one. The amount of these tests would vary depending on both the source and field. This could be an area where we get bogged down so discussing some general principles would be useful to keep things on track

ehanson8 · 2024-04-25T17:38:03Z

transmogrifier/sources/xml/dspace_dim.py

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

        return fields

+    @classmethod
+    def get_contents(cls, xml: Tag) -> list[str]:
+        return [


I propose not bothering with docstrings for the field methods unless there is something unique to describe. """Get values from source record for TIMDEX XXXXXXXX field..""" for each field does not seem to add any value (even though I did that on aardvark 🙃 )

I agree! :)

Also agreed!

ehanson8 · 2024-04-25T17:42:10Z

transmogrifier/sources/xml/dspace_dim.py

+    @classmethod
+    def _get_coverage_dates(cls, xml: Tag, source_record_id: str) -> list[timdex.Date]:
+        coverage_dates = []
+        for coverage_value in [
+            str(coverage_element.string)
+            for coverage_element in xml.find_all(
+                "dim:field", element="coverage", qualifier="temporal", string=True
+            )
+        ]:
+            if "/" in coverage_value:
+                date_object = cls._parse_date_range(coverage_value, source_record_id)
+            else:
+                date_object = timdex.Date(note=coverage_value, kind="coverage")
+            if date_object:
+                coverage_dates.append(date_object)
+        return coverage_dates
+
+    @classmethod
+    def _parse_date_range(


In the first pass, I propose minimizing the amount of refactoring to creating the field method from the appropriate get_optional_fields code block with the exception of overly large code blocks such as this. I don't know how many there will be but I expect dates will be a common culprit

ehanson8 · 2024-04-25T17:43:36Z

transmogrifier/sources/xml/dspace_dim.py

+    def get_contents(cls, xml: Tag) -> list[str]:
+        return [
+            str(contents.string)
+            for contents in xml.find_all(


Per https://mitlibraries.atlassian.net/browse/TIMX-276, we should soon evaluate whether we want to use lxml or BeautifulSoup to parse XML

ehanson8 · 2024-04-25T17:45:19Z

tests/sources/xml/test_dspace_dim.py

+
+
+def test_get_dates_success(dspace_dim_record_all_fields):
+    assert DspaceDim.get_dates(dspace_dim_record_all_fields, "abc123") == [


For field method _success tests, we should extract the values from the _all_fields_transforms_correctly test

jonavellecuerdo

@ehanson8 I think this PR does a great job at explaining the main goal of the refactor by providing a few informative examples without making sweeping changes all at once! I also appreciated the comments you added to guide our reviews.

I proposed a few updates! Looking forward to our team discussion on this first pass! ✨

jonavellecuerdo · 2024-04-26T16:03:36Z

transmogrifier/sources/xml/dspace_dim.py

-            )
-            if t.string
-        ] or None
+        fields["contents"] = self.get_contents(xml) or None


Just to make sure I understand, the inspiration behind these changes is the pattern we've established in our newer transformer MITAardvark, right? 🤔

Correct! It's not perfect but the closest we've come to the expected end state after this refactor

jonavellecuerdo · 2024-04-26T16:04:43Z

transmogrifier/sources/xml/dspace_dim.py

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

        return fields

+    @classmethod
+    def get_contents(cls, xml: Tag) -> list[str]:
+        return [


I agree! :)

jonavellecuerdo · 2024-04-26T17:18:33Z

transmogrifier/sources/xml/dspace_dim.py

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

        return fields

+    @classmethod
+    def get_contents(cls, xml: Tag) -> list[str]:


Can we rename xml -> source_record?

For instance, in the case of MITAardvark, we have:

Transformer.source_records

JSONTransformer.transform(cls, source_record: dict)

MITAardvark.<field_method>(cls, source_record: dict)

In that same vein, I think the following naming convention should apply:

Transformer.source_records

XMLTransformer.transform(cls, source_record: Tag)

DSpaceDim.<field_method>(cls, source_record: Tag)

If the purpose of this PR is to set an example for all other field methods, I'd strongly prefer making the change here! :)

Good catch, I'll update that and yes, that's what this PR is intended for!

I think I like source_record as well. That certainly handles either a JSONLines parsed dict or an XML parsed BS4 Tag or lxml Element (XML root).

jonavellecuerdo · 2024-04-26T17:31:27Z

transmogrifier/sources/xml/dspace_dim.py

+        split = coverage_value.index("/")
+        gte_date = coverage_value[:split]
+        lte_date = coverage_value[split + 1 :]
+        if validate_date_range(


Hmm, this might be a separate ticket, but I think it's worth discussing where validation should occur. 🤔 It seems like we define validators as part of

transmogrifier/transmogrifier/models.py

Line 6 in 008e20c

from attrs.validators import instance_of, optional

.

This means that the all custom validators (except for date validators) are run during initialization of TimdexRecord instances.

A good point and worthy of discussion, but this does feel like a slightly different type of validation than what we're doing init'ing a TimdexRecord which is more checking types rather than the value's content. I'll add it to the agenda!

ghukill

Thanks for this first pass at some "field methods" to provide a backdrop for discussions.

My comments contain the bulk of my thoughts so far, but to summarize here and add a couple others:

thinking we might want to consider small, stubbed source_record instances that are defined in each test vs. complicated fixtures
think the decision of BS4 vs lxml will have some bearing, but not a lot, on what field methods look like (e.g. if we're using XPath, to consider some helper methods)
very curious to talk about orchestration of these field methods, as I'm confident that will have some impact on their ergonomics
if a field method requires sub-methods as helpers, think about naming them in relation to the part of the data structure they are accessing if appropriate... or a variety of the TIMDEX field they further breaking down if more applicable. I think this will be hard to get "right", so probably just emerge over time.

Overall though, I think the get_contents() and get_dates() are looking like what I had hoped and expected! Concise, clear methods that pull a value for a particular field.

Looking forward towards more discussion, I think if we did want helper methods on XML or JSON objects for things like Xpath, then it's possible the source_record we pass to the field methods might have more capabilities, and therefore also potentially change those signatures. Specifically, I've always thought it's a little awkward to pass the source_record_id into the methods. If we pass an actual SourceRecord object, per say, we could have access to a source_record.identifier for example. Just food for thought in the discussion.

ghukill · 2024-04-29T13:48:12Z

transmogrifier/sources/xml/dspace_dim.py

+        for date_element in xml.find_all("dim:field", element="date", string=True):
+            date_value = str(date_element.string.strip())
+            if validate_date(date_value, source_record_id):
+                if date_element.get("qualifier") == "issued":
+                    date_object = timdex.Date(value=date_value, kind="Publication date")
+                else:
+                    date_object = timdex.Date(
+                        value=date_value, kind=date_element.get("qualifier") or None
+                    )
+                dates.append(date_object)


I might propose -- particularly for something like dates -- that even this could be broken out into a submethod like _get_issued_and_other_dates().

I'm noticing that both of these looking for <date> and <coverage> elements, they are XML element focused. If an attribute is present, then may give a specific kind value, but otherwise we just save the date, for both methods.

It might be worth considering if the sub-methods are oriented around the XML structure they are working with?

For example:

def get_dates(): dates = [] dates.extend(self._get_dates_from_date_elements()) dates.extend(self._get_dates_from_coverage_elements())

I'm a little unsure when it's better to pivot from the TIMDEX field we're working on, to the XML element we're using, when it comes to method and code organization. But maybe worth considering. If the "field methods" are TIMDEX-y (named to describe what TIMDEX fields it will set), but then sub methods are source record-y (named to describe what part of the source record structure the data comes from), maybe that could be pattern?

For example:

def get_dates(): dates = [] dates.extend(self._get_dates_from_date_elements()) dates.extend(self._get_dates_from_coverage_elements())

I like the logic of that pattern, let's discuss at meeting

ghukill · 2024-04-29T13:54:24Z

transmogrifier/sources/xml/dspace_dim.py

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

        return fields

+    @classmethod
+    def get_contents(cls, xml: Tag) -> list[str]:
+        return [


Also agreed!

ghukill · 2024-04-29T13:57:29Z

tests/fixtures/dspace/dspace_dim_record_errors.xml

+<records>
+    <record>
+        <metadata>
+            <dim:dim xmlns:dim="http://www.dspace.org/xmlns/dspace/dim"
+                xmlns:doc="http://www.lyncode.com/xoai"
+                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+                xsi:schemaLocation="http://www.dspace.org/xmlns/dspace/dim http://www.dspace.org/schema/dim.xsd">
+                <dim:field element="coverage" qualifier="temporal">2020-01-02/2019-01-01</dim:field>
+            </dim:dim>
+        </metadata>
+    </record>
+</records>


I'd like to touch on test fixtures in our discussion. I definitely see the thinking here of having a record with only data that will exercise error handling in the field methods, but I begin to question what this record is other than a vehicle for those XML elements; it does not strike me as a valid, realistic record anymore.

It could be confusing in the future when we look at a fixture like dspace_dim_record_errors.xml expecting to see a realistic DSpace record, just with problematic fields, but instead find just XML elements that will trigger specific field method error handling, without other fields like title, description, etc., that would be common to see.

Perhaps an alternate approach could be to craft problematic XML documents, perhaps with only a single field, in the test itself? Saving fixtures for "realistic"-ish records?

Good point, slightly worried about messy tests with the XML stubs but let's discuss

ghukill · 2024-04-29T14:17:41Z

tests/sources/xml/test_dspace_dim.py

+def test_get_contents_success(dspace_dim_record_all_fields):
+    assert DspaceDim.get_contents(dspace_dim_record_all_fields) == ["Chapter 1"]


In the spirit of the comment above about fixtures to exercise field methods, wondering how we might be able to send smaller, more targeted values/documents into the field method for testing?

Something like:

from bs4 import BeautifulSoup # define a utility method for each source that provides the template # and accepts an XML block of elements to insert def create_dspace_source_record_stub(xml_insert: str): xml_str = f""" <records> <record> <metadata> <dim:dim xmlns:dim="http://www.dspace.org/xmlns/dspace/dim" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dspace.org/xmlns/dspace/dim http://www.dspace.org/schema/dim.xsd"> {xml_insert} </dim:dim> </metadata> </record> </records> """ return BeautifulSoup(xml_str, "xml") # to test how contents are parsed, we pass only # the string of the XML that we expect our field method to test def test_get_contents_success_v2(): source_record = create_dspace_source_record_stub( """ <dim:field mdschema="dc" element="description" qualifier="tableofcontents" lang="en">Chapter 1</dim:field> """ ) assert DspaceDim.get_contents(source_record) == ["Chapter 1"]

I think the overhead of setting up the source record "stub" function for each source could be made up by the tests themselves being pretty self-contained.

We'd want to think about when or if it did make sense to have fully realized records as fixtures. Perhaps if we are interested in the orchestration of field methods, or how and when documents are parsed by the XMLTransformer or JSONTransformer. But this, or something similar, feels worth considering as means to avoid the headache of naming and managing fixtures.

If we opted to go down this path, we'd probably want to think a bit about the utility function to create the stubbed record: whether it should be in conftest or the testing file itself, etc. There are probably also some ramifications if we use BS4 vs lxml, but realistically likely not that many.

This looks like a good model and the logic is much cleaner than my proposed _errors fixture

Be interested to really kick around in discussion though. Because, thankfully, I don't think we have many (any?) transforms that look at element <foo> and element <bar> to determine a value for the final TIMDEX record, I think these XML blocks we insert will be an element or two.

If we were in the business of taking an element like <foo> and then comparing against other aspects of the record that are kind of tangential to that element, but help inform how to interpret, than this pattern could get cumbersome. But, maybe that'd be a good instance for a pytest fixture.

True, I believe marc has some examples involving 2 fields but not many

ghukill · 2024-04-29T16:50:50Z

transmogrifier/sources/xml/dspace_dim.py

@@ -259,6 +221,73 @@ def get_optional_fields(self, xml: Tag) -> dict | None:

        return fields

+    @classmethod
+    def get_contents(cls, xml: Tag) -> list[str]:


I think I like source_record as well. That certainly handles either a JSONLines parsed dict or an XML parsed BS4 Tag or lxml Element (XML root).

ehanson8 · 2024-04-29T17:57:35Z

Specifically, I've always thought it's a little awkward to pass the source_record_id into the methods. If we pass an actual SourceRecord object, per say, we could have access to a source_record.identifier for example. Just food for thought in the discussion.

Agreed, I hated adding that param in this commit because source_record.identifier should be a thing!

* Rename field method param xml > source_record * Refactor get_dates method for clarity * Remove _errors fixture * Add create_dspace_dim_source_record_stub function and use stubs records in dspace_dim unit tests

ehanson8 · 2024-04-30T14:21:45Z

tests/sources/xml/test_dspace_dim.py

-    assert DspaceDim.get_dates(dspace_dim_record_all_fields, "abc123") == [
+def test_get_dates_success():
+    source_record = create_dspace_dim_source_record_stub(
+        '<dim:field mdschema="dc" element="coverage" qualifier="temporal">'


Aesthetically, I don't love this but I do appreciate having everything right there instead of having to look in an XML file for the relevant values. Happy to see a better way of addressing the long line issue if you have one!

Agreed, this is probably the trickiest part to get ergonomically right.

Here's an option:

# at top of file, blanket skip long lines in a test file, which I'd be okay with # ruff: noqa: E501 # then in test, can prevent auto-formatting for a block of code with #fmt: off / #fmt: on source_record = create_dspace_dim_source_record_stub( # fmt: off """ <dim:field mdschema="dc" element="coverage" qualifier="temporal">1201-01-01 - 1965-12-21</dim:field> <dim:field mdschema="dc" element="coverage" qualifier="temporal">1201-01-01/1965-12-21</dim:field> <dim:field mdschema="dc" element="date" qualifier="accessioned">2009-01-08T16:24:37Z</dim:field> <dim:field mdschema="dc" element="date" qualifier="available">2009-01-08T16:24:37Z</dim:field> <dim:field mdschema="dc" element="date" qualifier="issued">2002-11</dim:field> <dim:field mdschema="dc" element="identifier" qualifier="uri">https://hdl.handle.net/1912/2641</dim:field> """ # fmt: on )

ehanson8 commented Apr 25, 2024

View reviewed changes

ehanson8 requested review from ghukill and jonavellecuerdo April 25, 2024 17:46

jonavellecuerdo reviewed Apr 26, 2024

View reviewed changes

ghukill reviewed Apr 29, 2024

View reviewed changes

Updates based on discussion in PR # 166

f29279c

* Rename field method param xml > source_record * Refactor get_dates method for clarity * Remove _errors fixture * Add create_dspace_dim_source_record_stub function and use stubs records in dspace_dim unit tests

ehanson8 commented Apr 30, 2024

View reviewed changes

ehanson8 added 2 commits April 30, 2024 13:46

Update formatting in test_dspace_dim.py

7f742e2

Rename xml > source_record

cee6ffc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploratory field method refactor #166

Exploratory field method refactor #166

ehanson8 commented Apr 25, 2024

ehanson8 Apr 25, 2024

ehanson8 Apr 25, 2024

ehanson8 Apr 25, 2024 •

edited

Loading

ehanson8 Apr 25, 2024

jonavellecuerdo Apr 26, 2024

ghukill Apr 29, 2024

ehanson8 Apr 25, 2024

ehanson8 Apr 25, 2024

ehanson8 Apr 25, 2024

jonavellecuerdo left a comment

jonavellecuerdo Apr 26, 2024

ehanson8 Apr 26, 2024

jonavellecuerdo Apr 26, 2024

jonavellecuerdo Apr 26, 2024

ehanson8 Apr 26, 2024

ghukill Apr 29, 2024

jonavellecuerdo Apr 26, 2024

ehanson8 Apr 26, 2024

ghukill left a comment •

edited

Loading

ghukill Apr 29, 2024

ghukill Apr 29, 2024

ehanson8 Apr 29, 2024

ghukill Apr 29, 2024

ghukill Apr 29, 2024

ehanson8 Apr 29, 2024

ghukill Apr 29, 2024

ghukill Apr 29, 2024

ehanson8 Apr 29, 2024

ghukill Apr 29, 2024

ehanson8 Apr 29, 2024

ghukill Apr 29, 2024

ehanson8 commented Apr 29, 2024

ehanson8 Apr 30, 2024

ghukill Apr 30, 2024

		def test_get_dates_invalid_date_range_skipped(dspace_dim_record_errors):
		assert DspaceDim.get_dates(dspace_dim_record_errors, "abc123") == []



		def test_get_dates_success(dspace_dim_record_all_fields):
		assert DspaceDim.get_dates(dspace_dim_record_all_fields, "abc123") == [

		def test_get_contents_success(dspace_dim_record_all_fields):
		assert DspaceDim.get_contents(dspace_dim_record_all_fields) == ["Chapter 1"]

Exploratory field method refactor #166

Are you sure you want to change the base?

Exploratory field method refactor #166

Conversation

ehanson8 commented Apr 25, 2024

Purpose and background context

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 commented Apr 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 Apr 25, 2024 •

edited

Loading

ghukill left a comment •

edited

Loading