Remove bbox requirement from MITAardvark and add WKT validator #138

jonavellecuerdo · 2024-02-07T14:48:34Z

Purpose and background context

This PR removes bounding box information as a requirement for MITAardvark records and adds a validator for well-known text (WKT) values for the dcat_bbox (and consequently, locn_geometry) geoshape fields. For review purposes, it might be good to view the work as two parts.

Part 1: Removing bounding box information as a requirement for MITAardvark records
The bulk of this work is done in the first commit. Some minor updates to support this were also made in the third commit.

Part 2: Adding WKT validator
The ValidateGeoshapeWKT class is an example of how we can implement universal validation methods (i.e., usable for any field methods independent of SourceRecord type) moving forward. The proposed framework is to create decorator classes representing validation methods and decorating SourceRecord field methods as-needed.

We were interested if we could add any validation earlier on in the pipeline to address the most common errors we get from indexing TIMDEX records into OpenSearch.
- (a): "at least three non-collinear points required"
- (b): "invalid [latitude/longitude] ; must be between [-90.0 and 90.0/-180.0 and 180.0]"
ValidateGeoshapeWKT: After initial testing and discussion (with @ghukill ), it was decided that we would limit the validation to "Is this value a valid WKT string?" There was some code earlier on that took another step forward to check if the "geometry" object (as shapely refers to it) is valid--using the shapely.is_valid and shapely.is_valid_reason methods.

Of the two most common errors described above, it would've only addressed (a). That said, we felt we would need to have a deeper understanding of shapely before using it. That is when we decided to take a step back and keep the validation simple.

How can a reviewer manually see the effects of these changes?

Review tests in geo-harvester/tests/test_records/test_validator.py.
Run make test and confirm all tests are passing

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-159

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

ghukill

Overall, I think the attention to code organization, modularity (which results in easy testing), and the impact this will have on shortening the feedback loop for invalid records is superb!

I left some comments about how this might need to be tweaked if we were to create new and other validators (e.g. sharing of some boilerplate logic), but don't think that's necessary now. Better to take this one for a spin, see how it works and feels, before over-engineering the structure for more validators.

But one comment I was interested in digging into more, around non-string values that get passed to the validator; the one with examples of None or [] values. While the end result is good (as noted), I wonder if a bit more edge case handling is needed.

Overall, great work though.

tests/test_records/test_validator.py

harvester/records/validators.py

ghukill · 2024-02-07T19:29:50Z

harvester/records/validators.py

+    def __init__(self, field_method: Callable):
+        update_wrapper(self, field_method)
+        self.field_method = field_method
+
+    def __call__(self, obj: object) -> str | None:
+        field_name = self.field_method.__name__.removeprefix("_")
+        value = self.field_method(obj)
+
+        try:
+            self.create_geoshape(value)
+        except Exception:  # noqa: BLE001
+            logger.warning(
+                FieldValueInvalidWarning(
+                    self.invalid_wkt_warning_message.format(
+                        field_name=field_name, value=value
+                    )
+                )
+            )
+            return None
+        return value
+
+    def __get__(self, obj: object, _: type) -> partial[str | None]:
+        """Required by decorator to access SourceRecord instance"""
+        return partial(self.__call__, obj)


Is it correct to assume that if we wanted to have other decorator based validators, that they would share this common logic (wrapping the method, getting the field name, etc.)? I don't think it's necessary now -- better to not over engineer -- but if we do add other validators, we'd probably want a base FieldMethodValidator class of sorts that would define all this.

Then, instead of trying self.create_geoshape(value) it could be more generic like self.validate(), which could contain any logic or calls to other methods as needed by that validator.

But again, don't think required now, just food for thought for if/when we extend this pattern.

Related, if we do extend it, we'd probably want to have some parameters for what to do when failing validation. In this case, we know it's okay to set to None as it's not a required field and we want the record to continue getting processed, just with a None value. But for others, we may decide that failing the field method validator should bubble up an exception. Or who knows, maybe set a different default value.

Again: all looks good for this PR, but noting in the event we extend this pattern.

ghukill · 2024-02-07T19:53:59Z

harvester/records/validators.py

+
+        try:
+            self.create_geoshape(value)
+        except Exception:  # noqa: BLE001


While I like that any failure of the validators logic results in a string warning like:

field: dcat_bbox, shapely was unable to parse WKT: 'ENVELOPE()'; setting value to None

I think the generic Exception handling here kind of hints at some underlying complexity and decisions to be made for this pattern.

To demonstrate a bit, here are a couple of tests I created locally:

def test_validator_none_value_fails_validation_returns_none( caplog, ): caplog.set_level("DEBUG") class TestValidatedXMLSourceRecord: @ValidateGeoshapeWKT def _locn_geometry(self): return None value = TestValidatedXMLSourceRecord()._locn_geometry() # noqa: SLF001 assert "field: locn_geometry, shapely was unable to parse WKT: 'None'" in caplog.text assert value == None def test_validator_list_value_fails_validation_returns_none( caplog, ): caplog.set_level("DEBUG") class TestValidatedXMLSourceRecord: @ValidateGeoshapeWKT def _locn_geometry(self): return [] value = TestValidatedXMLSourceRecord()._locn_geometry() # noqa: SLF001 assert "field: locn_geometry, shapely was unable to parse WKT: '[]'" in caplog.text assert value == None

The first test confirms that a true None value returned by the method fails validation because it's cast to a string, and of course "None" isn't a valid WKT string, but somewhat slyly the returned value is also None because that's the default.

In the second test, an empty python list fails validation -- more expected here -- though again cast to a string before validation testing.

In both cases, the final result is good: we have a None value in the record and passed value was logged. What I think could be improved would be if a None value is passed, it should skip even trying to validate and just return None; this might be true for records where there isn't a value and this method would naturally return None.

For the list... I wonder if the logic in the decorator should only attempt validations for strings? if not a string, fail validation immediately as well?

This kind of only matters to the degree that we want the validation warning to tell us precisely what went wrong. In both cases, the "unable to parse" logged warning was accurate, but only barely. And, if we did want future validators to actually re-throw an exception, it might be handy to have more detail in why things went wrong.

Happy to discuss more, but I think what this drives at is the interplay between the __call__ method, the generic handling of any exception from the actual work of the create_geoshape() method, and that method also casting anything to a string.

Agree with Graham's point about more granularity re: None or str

@ghukill and @ehanson8 Do you feel the latest commit resolves this issue? 🤔 Changes made were:

Move try...except logic into a ValidateGeoshapeWKT.validate() method.

Only log warnings for non-string values (i.e., skip validation with shapely).

ehanson8

Some questions but overall, this is a very good update

tests/test_records/test_validator.py

Why these changes are being introduced: * This enables Geo-Harvester to process geospatial metadata with missing or potentially invalid geospatial bounding box information. This change is in preparation of the ingestion of geospatial metadata from external sources that will likely vary in data quality. Having these fields as required may be too restrictive. How this addresses that need: * Remove bounding box fields as 'required' from MITAardvark JSON schema * Remove bounding box fields from test fixtures representing 'required' data * Move bounding box attributes to 'optional' group in harvester.records.MITAardvark Side effects of this change: * Required follow-up work: Geo-Harvester will validate bounding box information when present. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-159

Why these changes are being introduced: * This is part of an effort to add some data validation to the TIMDEX pipeline, with an aim to catch errors with records in earlier stages (e.g., during the harvest). This also proposes a workflow wherein validation methods can easily be applied as decorators to field methods for a given SourceRecord. How this addresses that need: * Add 'shapely' as a dependency * Create custom warning for invalid field values * Create decorator method for validating geoshape WKT values * Returns None if geoshape WKT value is invalid (cannot be parsed) Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-159

* Create a fixture mocking a SourceRecord with validated fields * Revise ValidateGeoshapeWKT validator to handle different cases: * field method returns invalid string value (shapely cannot parse) * field method returns Nonetype value * field method returns value that is (not None AND not string) * Update test suite as needed

jonavellecuerdo · 2024-02-09T17:40:55Z

@ehanson8 and @ghukill --Please see the latest commits!

Note(s):

On the subject of creating an abstract FieldMethodValidator class -- I am definitely on board for it but would like it to be a separate effort from this PR. I need to do some reading on how to properly create an abstract decorator class. 🤔
I am interested in knowing if the proposed changes to ValidateGeoshapeWKT resolve some of the concerns with field methods returning non-string values and producing warning messages that differentiate each case.

ghukill

Left one comment/suggestion, but consider it approved from my end!

As noted, I have a feeling there is a refactor in store for GeoHarvester. While it may not touch this directly, it could be an opportunity to also revisit the validations a little bit. I do think getting this merged sooner than later will be more impactful than the suggested tweaks.

harvester/records/validators.py

ghukill · 2024-02-09T18:59:54Z

harvester/records/validators.py

+# Format Validators
+#####################
+@define
+class MITAardvarkFormatValidator:


Just noting as @jonavellecuerdo and I have discussed, probably some room for optimizing how and when this reads schemas from disk, and loads the validator into memory, but tested locally and more than sufficiently quick and memory concious as-is for now.

@ghukill I created a JIRA ticket GDT-180 GeoHarvester Optimizations and added a subtask for this effort (GDT-181). Will groom soon!

ghukill · 2024-02-09T19:45:18Z

harvester/records/validators.py

+    invalid_wkt_warning_message: str = (
+        "field: {field_name}, shapely was unable to parse string as WKT: '{value}'; "
+        "setting value to None"
+    )
+    invalid_none_warning_message: str = (
+        "field: {field_name}, value of type NoneType was provided; "
+        "returning original value of None"
+    )
+    invalid_type_warning_message: str = (
+        "field: {field_name}, value of type {type} was provided: {value}; "
+        "setting value to None"
+    )
+
+    def __init__(self, field_method: Callable):
+        update_wrapper(self, field_method)
+        self.field_method = field_method
+        self.field_name = field_method.__name__.removeprefix("_")
+
+    def __call__(self, obj: object) -> str | None:
+        """Validate string values retrieved by field method."""
+        value = self.field_method(obj)
+
+        if isinstance(value, str):
+            return self.validate(value)
+
+        if value is None:
+            logger.warning(
+                FieldValueInvalidWarning(
+                    self.invalid_none_warning_message.format(
+                        field_name=self.field_name, value=value
+                    )
+                )
+            )
+            return None
+
+        logger.warning(
+            FieldValueInvalidWarning(
+                self.invalid_type_warning_message.format(
+                    field_name=self.field_name, type=type(value), value=value
+                )
+            )
+        )
+        return None


I might suggest a bit of simplifying here, namely just a shared, generic logging warning when the value validated is not a valid WKT string. Something like:

field: {field_name}, unable to parse WKT from value: '{value}'; setting value to None"

This could apply to all scenarios:

bad string (shapely fails)

None value (potentially pretty common)

non-string value passed.

For a couple of reasons:

I think keeping the implementation detail of the shapely library out of the logging makes sense at the logging level; if we ever changed the validation logic or mechanism, we can continue to just say "it's not valid WKT"

Not sure we need that granularity in the logging output, if they all return None, and this would reduce code complexity here

Not a required change, please pushback if you'd like to keep it! Can also earmark for future optimization work that might occur around validations.

Hmm, I added shapely when---after our sync--you mentioned that someone reviewing the logs might wonder "who/what" is unable to parse shapely. 🤔

That said, we will be the folks reviewing these logs, right? I do think it would be easy for us to remember what package is used by reviewing the code, so simplifying the log message wouldn't hurt.

@ghukill and @ehanson8 --would you both be alright if I updated the code to instead read:

def __call__(self, obj: object) -> str | None: """Validate string values retrieved by field method.""" value = self.field_method(obj) if isinstance(value, str): return self.validate(value) return None

and update the logged warning to simply say:

"field: {field_name}, unable to parse WKT: {<NoneType> | "<invalid-string-value>" |<invalid-nonstring-value>}, returning None" ``` * "returning None" = value is reset to None OR original None value returned

Yes @jonavellecuerdo, looks good to me. Apologies if I changed my thinking since our chat, or hadn't fully considered all the scenarios. I'm recalling precisely what you mentioned now... that as a log reader I might wonder, "Why isn't this valid? Looks good to me..." if I didn't know we ran it through shapely first.

I'm hopeful this decorator validation pattern is something we can apply to other fields in this project, and even other projects. I think it's a fine line, and probably exceptions at every turn, but I think we should try and keep the implementation details (like, the lib shapely was used) out of the warning messages when something is not valid.

We could optionally write another DEBUG line that gives more detail if we wanted?

But I think this looks good.

Thinking more: I guess it comes down to I think the code complexity of all those logging routes don't outweigh the information loss in the logs.

@ghukill Your last comment makes a really good point! No apologies needed--lots of good learning along the way. 🤓

Please see the latest commit!

Thanks for your willingness to go back and forth a bit @jonavellecuerdo. I think that last comment was the salient one; took awhile to get there.

tests/test_records/test_validator.py

ghukill

Looks good to me!

harvester/records/validators.py

jonavellecuerdo force-pushed the GDT-159-remove-bbox-requirement-from-mitaardvark-and-add-wkt-validator branch from 062d61e to 4339baa Compare February 7, 2024 16:01

jonavellecuerdo marked this pull request as ready for review February 7, 2024 16:05

jonavellecuerdo requested review from ghukill and ehanson8 February 7, 2024 16:05

ghukill reviewed Feb 7, 2024

View reviewed changes

ehanson8 reviewed Feb 8, 2024

View reviewed changes

tests/test_records/test_validator.py Outdated Show resolved Hide resolved

tests/test_records/test_validator.py Outdated Show resolved Hide resolved

jonavellecuerdo added 4 commits February 9, 2024 10:34

Update SourceRecord classes (FGDC, ISO19139)

9edf677

Move MITAardvark format validation method into validator module

7b03b8d

jonavellecuerdo force-pushed the GDT-159-remove-bbox-requirement-from-mitaardvark-and-add-wkt-validator branch from 4339baa to 7b03b8d Compare February 9, 2024 15:35

jonavellecuerdo requested review from ghukill and ehanson8 February 9, 2024 17:37

ghukill approved these changes Feb 9, 2024

View reviewed changes

ehanson8 approved these changes Feb 9, 2024

View reviewed changes

tests/test_records/test_validator.py Outdated Show resolved Hide resolved

Simplify logging for invalid WKT warning

fd8478c

jonavellecuerdo requested review from ghukill and ehanson8 February 9, 2024 20:56

ghukill approved these changes Feb 9, 2024

View reviewed changes

harvester/records/validators.py Show resolved Hide resolved

ehanson8 approved these changes Feb 9, 2024

View reviewed changes

jonavellecuerdo merged commit 53e02bf into main Feb 9, 2024
5 checks passed

jonavellecuerdo deleted the GDT-159-remove-bbox-requirement-from-mitaardvark-and-add-wkt-validator branch February 9, 2024 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove bbox requirement from MITAardvark and add WKT validator #138

Remove bbox requirement from MITAardvark and add WKT validator #138

jonavellecuerdo commented Feb 7, 2024 •

edited by ehanson8

Loading

ghukill left a comment

ghukill Feb 7, 2024

ghukill Feb 7, 2024

ghukill Feb 7, 2024 •

edited

Loading

ehanson8 Feb 8, 2024

jonavellecuerdo Feb 9, 2024

ehanson8 left a comment

jonavellecuerdo commented Feb 9, 2024

ghukill left a comment

ghukill Feb 9, 2024

jonavellecuerdo Feb 9, 2024

ghukill Feb 9, 2024

jonavellecuerdo Feb 9, 2024

ghukill Feb 9, 2024

ghukill Feb 9, 2024

ehanson8 Feb 9, 2024

jonavellecuerdo Feb 9, 2024

ghukill Feb 9, 2024

ghukill left a comment

Remove bbox requirement from MITAardvark and add WKT validator #138

Remove bbox requirement from MITAardvark and add WKT validator #138

Conversation

jonavellecuerdo commented Feb 7, 2024 • edited by ehanson8 Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

jonavellecuerdo commented Feb 9, 2024

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

jonavellecuerdo commented Feb 7, 2024 •

edited by ehanson8

Loading

ghukill Feb 7, 2024 •

edited

Loading