Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove bbox requirement from MITAardvark and add WKT validator #138

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Feb 7, 2024

Purpose and background context

This PR removes bounding box information as a requirement for MITAardvark records and adds a validator for well-known text (WKT) values for the dcat_bbox (and consequently, locn_geometry) geoshape fields. For review purposes, it might be good to view the work as two parts.

Part 1: Removing bounding box information as a requirement for MITAardvark records
The bulk of this work is done in the first commit. Some minor updates to support this were also made in the third commit.

Part 2: Adding WKT validator
The ValidateGeoshapeWKT class is an example of how we can implement universal validation methods (i.e., usable for any field methods independent of SourceRecord type) moving forward. The proposed framework is to create decorator classes representing validation methods and decorating SourceRecord field methods as-needed.

  1. We were interested if we could add any validation earlier on in the pipeline to address the most common errors we get from indexing TIMDEX records into OpenSearch.

    • (a): "at least three non-collinear points required"
    • (b): "invalid [latitude/longitude] ; must be between [-90.0 and 90.0/-180.0 and 180.0]"
  2. ValidateGeoshapeWKT: After initial testing and discussion (with @ghukill ), it was decided that we would limit the validation to "Is this value a valid WKT string?" There was some code earlier on that took another step forward to check if the "geometry" object (as shapely refers to it) is valid--using the shapely.is_valid and shapely.is_valid_reason methods.

    Of the two most common errors described above, it would've only addressed (a). That said, we felt we would need to have a deeper understanding of shapely before using it. That is when we decided to take a step back and keep the validation simple.

How can a reviewer manually see the effects of these changes?

  1. Review tests in geo-harvester/tests/test_records/test_validator.py.
  2. Run make test and confirm all tests are passing

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-159

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo force-pushed the GDT-159-remove-bbox-requirement-from-mitaardvark-and-add-wkt-validator branch from 062d61e to 4339baa Compare February 7, 2024 16:01
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review February 7, 2024 16:05
Copy link
Collaborator

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the attention to code organization, modularity (which results in easy testing), and the impact this will have on shortening the feedback loop for invalid records is superb!

I left some comments about how this might need to be tweaked if we were to create new and other validators (e.g. sharing of some boilerplate logic), but don't think that's necessary now. Better to take this one for a spin, see how it works and feels, before over-engineering the structure for more validators.

But one comment I was interested in digging into more, around non-string values that get passed to the validator; the one with examples of None or [] values. While the end result is good (as noted), I wonder if a bit more edge case handling is needed.

Overall, great work though.

tests/test_records/test_validator.py Outdated Show resolved Hide resolved
harvester/records/validators.py Outdated Show resolved Hide resolved
Comment on lines 31 to 63
def __init__(self, field_method: Callable):
update_wrapper(self, field_method)
self.field_method = field_method

def __call__(self, obj: object) -> str | None:
field_name = self.field_method.__name__.removeprefix("_")
value = self.field_method(obj)

try:
self.create_geoshape(value)
except Exception: # noqa: BLE001
logger.warning(
FieldValueInvalidWarning(
self.invalid_wkt_warning_message.format(
field_name=field_name, value=value
)
)
)
return None
return value

def __get__(self, obj: object, _: type) -> partial[str | None]:
"""Required by decorator to access SourceRecord instance"""
return partial(self.__call__, obj)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to assume that if we wanted to have other decorator based validators, that they would share this common logic (wrapping the method, getting the field name, etc.)? I don't think it's necessary now -- better to not over engineer -- but if we do add other validators, we'd probably want a base FieldMethodValidator class of sorts that would define all this.

Then, instead of trying self.create_geoshape(value) it could be more generic like self.validate(), which could contain any logic or calls to other methods as needed by that validator.

But again, don't think required now, just food for thought for if/when we extend this pattern.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related, if we do extend it, we'd probably want to have some parameters for what to do when failing validation. In this case, we know it's okay to set to None as it's not a required field and we want the record to continue getting processed, just with a None value. But for others, we may decide that failing the field method validator should bubble up an exception. Or who knows, maybe set a different default value.

Again: all looks good for this PR, but noting in the event we extend this pattern.


try:
self.create_geoshape(value)
except Exception: # noqa: BLE001
Copy link
Collaborator

@ghukill ghukill Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like that any failure of the validators logic results in a string warning like:

field: dcat_bbox, shapely was unable to parse WKT: 'ENVELOPE()'; setting value to None

I think the generic Exception handling here kind of hints at some underlying complexity and decisions to be made for this pattern.

To demonstrate a bit, here are a couple of tests I created locally:

def test_validator_none_value_fails_validation_returns_none(
    caplog,
):
    caplog.set_level("DEBUG")

    class TestValidatedXMLSourceRecord:
        @ValidateGeoshapeWKT
        def _locn_geometry(self):
            return None

    value = TestValidatedXMLSourceRecord()._locn_geometry()  # noqa: SLF001
    assert "field: locn_geometry, shapely was unable to parse WKT: 'None'" in caplog.text
    assert value == None


def test_validator_list_value_fails_validation_returns_none(
    caplog,
):
    caplog.set_level("DEBUG")

    class TestValidatedXMLSourceRecord:
        @ValidateGeoshapeWKT
        def _locn_geometry(self):
            return []

    value = TestValidatedXMLSourceRecord()._locn_geometry()  # noqa: SLF001
    assert "field: locn_geometry, shapely was unable to parse WKT: '[]'" in caplog.text
    assert value == None

The first test confirms that a true None value returned by the method fails validation because it's cast to a string, and of course "None" isn't a valid WKT string, but somewhat slyly the returned value is also None because that's the default.

In the second test, an empty python list fails validation -- more expected here -- though again cast to a string before validation testing.

In both cases, the final result is good: we have a None value in the record and passed value was logged. What I think could be improved would be if a None value is passed, it should skip even trying to validate and just return None; this might be true for records where there isn't a value and this method would naturally return None.

For the list... I wonder if the logic in the decorator should only attempt validations for strings? if not a string, fail validation immediately as well?

This kind of only matters to the degree that we want the validation warning to tell us precisely what went wrong. In both cases, the "unable to parse" logged warning was accurate, but only barely. And, if we did want future validators to actually re-throw an exception, it might be handy to have more detail in why things went wrong.

Happy to discuss more, but I think what this drives at is the interplay between the __call__ method, the generic handling of any exception from the actual work of the create_geoshape() method, and that method also casting anything to a string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Graham's point about more granularity re: None or str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill and @ehanson8 Do you feel the latest commit resolves this issue? 🤔 Changes made were:

  • Move try...except logic into a ValidateGeoshapeWKT.validate() method.
  • Only log warnings for non-string values (i.e., skip validation with shapely).

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions but overall, this is a very good update

tests/test_records/test_validator.py Outdated Show resolved Hide resolved
tests/test_records/test_validator.py Outdated Show resolved Hide resolved
Why these changes are being introduced:
* This enables Geo-Harvester to process geospatial metadata with missing
or potentially invalid geospatial bounding box information. This change is
in preparation of the ingestion of geospatial metadata from external sources
that will likely vary in data quality. Having these fields as required may be
too restrictive.

How this addresses that need:
* Remove bounding box fields as 'required' from MITAardvark JSON schema
* Remove bounding box fields from test fixtures representing 'required' data
* Move bounding box attributes to 'optional' group in harvester.records.MITAardvark

Side effects of this change:
* Required follow-up work: Geo-Harvester will validate bounding box information when present.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-159
Why these changes are being introduced:
* This is part of an effort to add some data validation to the TIMDEX
pipeline, with an aim to catch errors with records in earlier stages
(e.g., during the harvest). This also proposes a workflow wherein
validation methods can easily be applied as decorators to field methods
for a given SourceRecord.

How this addresses that need:
* Add 'shapely' as a dependency
* Create custom warning for invalid field values
* Create decorator method for validating geoshape WKT values
   * Returns None if geoshape WKT value is invalid (cannot be parsed)

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-159
@jonavellecuerdo jonavellecuerdo force-pushed the GDT-159-remove-bbox-requirement-from-mitaardvark-and-add-wkt-validator branch from 4339baa to 7b03b8d Compare February 9, 2024 15:35
* Create a fixture mocking a SourceRecord with validated fields
* Revise ValidateGeoshapeWKT validator to handle different cases:
   * field method returns invalid string value (shapely cannot parse)
   * field method returns Nonetype value
   * field method returns value that is (not None AND not string)
* Update test suite as needed
@jonavellecuerdo
Copy link
Contributor Author

@ehanson8 and @ghukill --Please see the latest commits!

Note(s):

  1. On the subject of creating an abstract FieldMethodValidator class -- I am definitely on board for it but would like it to be a separate effort from this PR. I need to do some reading on how to properly create an abstract decorator class. 🤔
  2. I am interested in knowing if the proposed changes to ValidateGeoshapeWKT resolve some of the concerns with field methods returning non-string values and producing warning messages that differentiate each case.

Copy link
Collaborator

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment/suggestion, but consider it approved from my end!

As noted, I have a feeling there is a refactor in store for GeoHarvester. While it may not touch this directly, it could be an opportunity to also revisit the validations a little bit. I do think getting this merged sooner than later will be more impactful than the suggested tweaks.

harvester/records/validators.py Outdated Show resolved Hide resolved
harvester/records/validators.py Outdated Show resolved Hide resolved
# Format Validators
#####################
@define
class MITAardvarkFormatValidator:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting as @jonavellecuerdo and I have discussed, probably some room for optimizing how and when this reads schemas from disk, and loads the validator into memory, but tested locally and more than sufficiently quick and memory concious as-is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill I created a JIRA ticket GDT-180 GeoHarvester Optimizations and added a subtask for this effort (GDT-181). Will groom soon!

Comment on lines 47 to 89
invalid_wkt_warning_message: str = (
"field: {field_name}, shapely was unable to parse string as WKT: '{value}'; "
"setting value to None"
)
invalid_none_warning_message: str = (
"field: {field_name}, value of type NoneType was provided; "
"returning original value of None"
)
invalid_type_warning_message: str = (
"field: {field_name}, value of type {type} was provided: {value}; "
"setting value to None"
)

def __init__(self, field_method: Callable):
update_wrapper(self, field_method)
self.field_method = field_method
self.field_name = field_method.__name__.removeprefix("_")

def __call__(self, obj: object) -> str | None:
"""Validate string values retrieved by field method."""
value = self.field_method(obj)

if isinstance(value, str):
return self.validate(value)

if value is None:
logger.warning(
FieldValueInvalidWarning(
self.invalid_none_warning_message.format(
field_name=self.field_name, value=value
)
)
)
return None

logger.warning(
FieldValueInvalidWarning(
self.invalid_type_warning_message.format(
field_name=self.field_name, type=type(value), value=value
)
)
)
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest a bit of simplifying here, namely just a shared, generic logging warning when the value validated is not a valid WKT string. Something like:

field: {field_name}, unable to parse WKT from value: '{value}'; setting value to None"

This could apply to all scenarios:

  • bad string (shapely fails)
  • None value (potentially pretty common)
  • non-string value passed.

For a couple of reasons:

  • I think keeping the implementation detail of the shapely library out of the logging makes sense at the logging level; if we ever changed the validation logic or mechanism, we can continue to just say "it's not valid WKT"
  • Not sure we need that granularity in the logging output, if they all return None, and this would reduce code complexity here

Not a required change, please pushback if you'd like to keep it! Can also earmark for future optimization work that might occur around validations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I added shapely when---after our sync--you mentioned that someone reviewing the logs might wonder "who/what" is unable to parse shapely. 🤔

That said, we will be the folks reviewing these logs, right? I do think it would be easy for us to remember what package is used by reviewing the code, so simplifying the log message wouldn't hurt.

@ghukill and @ehanson8 --would you both be alright if I updated the code to instead read:

def __call__(self, obj: object) -> str | None:
        """Validate string values retrieved by field method."""
        value = self.field_method(obj)

        if isinstance(value, str):
            return self.validate(value)
        return None

and update the logged warning to simply say:

"field: {field_name}, unable to parse WKT: {<NoneType> | "<invalid-string-value>" |<invalid-nonstring-value>}, returning None"
```
* "returning None" = value is reset to None OR original None value returned

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @jonavellecuerdo, looks good to me. Apologies if I changed my thinking since our chat, or hadn't fully considered all the scenarios. I'm recalling precisely what you mentioned now... that as a log reader I might wonder, "Why isn't this valid? Looks good to me..." if I didn't know we ran it through shapely first.

I'm hopeful this decorator validation pattern is something we can apply to other fields in this project, and even other projects. I think it's a fine line, and probably exceptions at every turn, but I think we should try and keep the implementation details (like, the lib shapely was used) out of the warning messages when something is not valid.

We could optionally write another DEBUG line that gives more detail if we wanted?

But I think this looks good.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more: I guess it comes down to I think the code complexity of all those logging routes don't outweigh the information loss in the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill Your last comment makes a really good point! No apologies needed--lots of good learning along the way. 🤓

Please see the latest commit!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your willingness to go back and forth a bit @jonavellecuerdo. I think that last comment was the salient one; took awhile to get there.

tests/test_records/test_validator.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

harvester/records/validators.py Show resolved Hide resolved
@jonavellecuerdo jonavellecuerdo merged commit 53e02bf into main Feb 9, 2024
5 checks passed
@jonavellecuerdo jonavellecuerdo deleted the GDT-159-remove-bbox-requirement-from-mitaardvark-and-add-wkt-validator branch February 9, 2024 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants