-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove bbox requirement from MITAardvark and add WKT validator #138
Remove bbox requirement from MITAardvark and add WKT validator #138
Conversation
062d61e
to
4339baa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the attention to code organization, modularity (which results in easy testing), and the impact this will have on shortening the feedback loop for invalid records is superb!
I left some comments about how this might need to be tweaked if we were to create new and other validators (e.g. sharing of some boilerplate logic), but don't think that's necessary now. Better to take this one for a spin, see how it works and feels, before over-engineering the structure for more validators.
But one comment I was interested in digging into more, around non-string values that get passed to the validator; the one with examples of None
or []
values. While the end result is good (as noted), I wonder if a bit more edge case handling is needed.
Overall, great work though.
harvester/records/validators.py
Outdated
def __init__(self, field_method: Callable): | ||
update_wrapper(self, field_method) | ||
self.field_method = field_method | ||
|
||
def __call__(self, obj: object) -> str | None: | ||
field_name = self.field_method.__name__.removeprefix("_") | ||
value = self.field_method(obj) | ||
|
||
try: | ||
self.create_geoshape(value) | ||
except Exception: # noqa: BLE001 | ||
logger.warning( | ||
FieldValueInvalidWarning( | ||
self.invalid_wkt_warning_message.format( | ||
field_name=field_name, value=value | ||
) | ||
) | ||
) | ||
return None | ||
return value | ||
|
||
def __get__(self, obj: object, _: type) -> partial[str | None]: | ||
"""Required by decorator to access SourceRecord instance""" | ||
return partial(self.__call__, obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to assume that if we wanted to have other decorator based validators, that they would share this common logic (wrapping the method, getting the field name, etc.)? I don't think it's necessary now -- better to not over engineer -- but if we do add other validators, we'd probably want a base FieldMethodValidator
class of sorts that would define all this.
Then, instead of trying self.create_geoshape(value)
it could be more generic like self.validate()
, which could contain any logic or calls to other methods as needed by that validator.
But again, don't think required now, just food for thought for if/when we extend this pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related, if we do extend it, we'd probably want to have some parameters for what to do when failing validation. In this case, we know it's okay to set to None
as it's not a required field and we want the record to continue getting processed, just with a None
value. But for others, we may decide that failing the field method validator should bubble up an exception. Or who knows, maybe set a different default value.
Again: all looks good for this PR, but noting in the event we extend this pattern.
|
||
try: | ||
self.create_geoshape(value) | ||
except Exception: # noqa: BLE001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like that any failure of the validators logic results in a string warning like:
field: dcat_bbox, shapely was unable to parse WKT: 'ENVELOPE()'; setting value to None
I think the generic Exception
handling here kind of hints at some underlying complexity and decisions to be made for this pattern.
To demonstrate a bit, here are a couple of tests I created locally:
def test_validator_none_value_fails_validation_returns_none(
caplog,
):
caplog.set_level("DEBUG")
class TestValidatedXMLSourceRecord:
@ValidateGeoshapeWKT
def _locn_geometry(self):
return None
value = TestValidatedXMLSourceRecord()._locn_geometry() # noqa: SLF001
assert "field: locn_geometry, shapely was unable to parse WKT: 'None'" in caplog.text
assert value == None
def test_validator_list_value_fails_validation_returns_none(
caplog,
):
caplog.set_level("DEBUG")
class TestValidatedXMLSourceRecord:
@ValidateGeoshapeWKT
def _locn_geometry(self):
return []
value = TestValidatedXMLSourceRecord()._locn_geometry() # noqa: SLF001
assert "field: locn_geometry, shapely was unable to parse WKT: '[]'" in caplog.text
assert value == None
The first test confirms that a true None
value returned by the method fails validation because it's cast to a string, and of course "None"
isn't a valid WKT string, but somewhat slyly the returned value is also None
because that's the default.
In the second test, an empty python list fails validation -- more expected here -- though again cast to a string before validation testing.
In both cases, the final result is good: we have a None
value in the record and passed value was logged. What I think could be improved would be if a None
value is passed, it should skip even trying to validate and just return None
; this might be true for records where there isn't a value and this method would naturally return None
.
For the list... I wonder if the logic in the decorator should only attempt validations for strings? if not a string, fail validation immediately as well?
This kind of only matters to the degree that we want the validation warning to tell us precisely what went wrong. In both cases, the "unable to parse" logged warning was accurate, but only barely. And, if we did want future validators to actually re-throw an exception, it might be handy to have more detail in why things went wrong.
Happy to discuss more, but I think what this drives at is the interplay between the __call__
method, the generic handling of any exception from the actual work of the create_geoshape()
method, and that method also casting anything to a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Graham's point about more granularity re: None
or str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions but overall, this is a very good update
Why these changes are being introduced: * This enables Geo-Harvester to process geospatial metadata with missing or potentially invalid geospatial bounding box information. This change is in preparation of the ingestion of geospatial metadata from external sources that will likely vary in data quality. Having these fields as required may be too restrictive. How this addresses that need: * Remove bounding box fields as 'required' from MITAardvark JSON schema * Remove bounding box fields from test fixtures representing 'required' data * Move bounding box attributes to 'optional' group in harvester.records.MITAardvark Side effects of this change: * Required follow-up work: Geo-Harvester will validate bounding box information when present. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-159
Why these changes are being introduced: * This is part of an effort to add some data validation to the TIMDEX pipeline, with an aim to catch errors with records in earlier stages (e.g., during the harvest). This also proposes a workflow wherein validation methods can easily be applied as decorators to field methods for a given SourceRecord. How this addresses that need: * Add 'shapely' as a dependency * Create custom warning for invalid field values * Create decorator method for validating geoshape WKT values * Returns None if geoshape WKT value is invalid (cannot be parsed) Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-159
4339baa
to
7b03b8d
Compare
* Create a fixture mocking a SourceRecord with validated fields * Revise ValidateGeoshapeWKT validator to handle different cases: * field method returns invalid string value (shapely cannot parse) * field method returns Nonetype value * field method returns value that is (not None AND not string) * Update test suite as needed
@ehanson8 and @ghukill --Please see the latest commits! Note(s):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one comment/suggestion, but consider it approved from my end!
As noted, I have a feeling there is a refactor in store for GeoHarvester. While it may not touch this directly, it could be an opportunity to also revisit the validations a little bit. I do think getting this merged sooner than later will be more impactful than the suggested tweaks.
# Format Validators | ||
##################### | ||
@define | ||
class MITAardvarkFormatValidator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting as @jonavellecuerdo and I have discussed, probably some room for optimizing how and when this reads schemas from disk, and loads the validator into memory, but tested locally and more than sufficiently quick and memory concious as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill I created a JIRA ticket GDT-180 GeoHarvester Optimizations and added a subtask for this effort (GDT-181). Will groom soon!
harvester/records/validators.py
Outdated
invalid_wkt_warning_message: str = ( | ||
"field: {field_name}, shapely was unable to parse string as WKT: '{value}'; " | ||
"setting value to None" | ||
) | ||
invalid_none_warning_message: str = ( | ||
"field: {field_name}, value of type NoneType was provided; " | ||
"returning original value of None" | ||
) | ||
invalid_type_warning_message: str = ( | ||
"field: {field_name}, value of type {type} was provided: {value}; " | ||
"setting value to None" | ||
) | ||
|
||
def __init__(self, field_method: Callable): | ||
update_wrapper(self, field_method) | ||
self.field_method = field_method | ||
self.field_name = field_method.__name__.removeprefix("_") | ||
|
||
def __call__(self, obj: object) -> str | None: | ||
"""Validate string values retrieved by field method.""" | ||
value = self.field_method(obj) | ||
|
||
if isinstance(value, str): | ||
return self.validate(value) | ||
|
||
if value is None: | ||
logger.warning( | ||
FieldValueInvalidWarning( | ||
self.invalid_none_warning_message.format( | ||
field_name=self.field_name, value=value | ||
) | ||
) | ||
) | ||
return None | ||
|
||
logger.warning( | ||
FieldValueInvalidWarning( | ||
self.invalid_type_warning_message.format( | ||
field_name=self.field_name, type=type(value), value=value | ||
) | ||
) | ||
) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest a bit of simplifying here, namely just a shared, generic logging warning when the value validated is not a valid WKT string. Something like:
field: {field_name}, unable to parse WKT from value: '{value}'; setting value to None"
This could apply to all scenarios:
- bad string (shapely fails)
None
value (potentially pretty common)- non-string value passed.
For a couple of reasons:
- I think keeping the implementation detail of the
shapely
library out of the logging makes sense at the logging level; if we ever changed the validation logic or mechanism, we can continue to just say "it's not valid WKT" - Not sure we need that granularity in the logging output, if they all return
None
, and this would reduce code complexity here
Not a required change, please pushback if you'd like to keep it! Can also earmark for future optimization work that might occur around validations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I added shapely
when---after our sync--you mentioned that someone reviewing the logs might wonder "who/what" is unable to parse shapely
. 🤔
That said, we will be the folks reviewing these logs, right? I do think it would be easy for us to remember what package is used by reviewing the code, so simplifying the log message wouldn't hurt.
@ghukill and @ehanson8 --would you both be alright if I updated the code to instead read:
def __call__(self, obj: object) -> str | None:
"""Validate string values retrieved by field method."""
value = self.field_method(obj)
if isinstance(value, str):
return self.validate(value)
return None
and update the logged warning to simply say:
"field: {field_name}, unable to parse WKT: {<NoneType> | "<invalid-string-value>" |<invalid-nonstring-value>}, returning None"
```
* "returning None" = value is reset to None OR original None value returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @jonavellecuerdo, looks good to me. Apologies if I changed my thinking since our chat, or hadn't fully considered all the scenarios. I'm recalling precisely what you mentioned now... that as a log reader I might wonder, "Why isn't this valid? Looks good to me..." if I didn't know we ran it through shapely
first.
I'm hopeful this decorator validation pattern is something we can apply to other fields in this project, and even other projects. I think it's a fine line, and probably exceptions at every turn, but I think we should try and keep the implementation details (like, the lib shapely
was used) out of the warning messages when something is not valid.
We could optionally write another DEBUG
line that gives more detail if we wanted?
But I think this looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more: I guess it comes down to I think the code complexity of all those logging routes don't outweigh the information loss in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill Your last comment makes a really good point! No apologies needed--lots of good learning along the way. 🤓
Please see the latest commit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your willingness to go back and forth a bit @jonavellecuerdo. I think that last comment was the salient one; took awhile to get there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Purpose and background context
This PR removes bounding box information as a requirement for MITAardvark records and adds a validator for well-known text (WKT) values for the
dcat_bbox
(and consequently,locn_geometry
) geoshape fields. For review purposes, it might be good to view the work as two parts.Part 1: Removing bounding box information as a requirement for MITAardvark records
The bulk of this work is done in the first commit. Some minor updates to support this were also made in the third commit.
Part 2: Adding WKT validator
The
ValidateGeoshapeWKT
class is an example of how we can implement universal validation methods (i.e., usable for any field methods independent ofSourceRecord
type) moving forward. The proposed framework is to create decorator classes representing validation methods and decoratingSourceRecord
field methods as-needed.We were interested if we could add any validation earlier on in the pipeline to address the most common errors we get from indexing TIMDEX records into OpenSearch.
ValidateGeoshapeWKT: After initial testing and discussion (with @ghukill ), it was decided that we would limit the validation to "Is this value a valid WKT string?" There was some code earlier on that took another step forward to check if the "geometry" object (as
shapely
refers to it) is valid--using the shapely.is_valid and shapely.is_valid_reason methods.Of the two most common errors described above, it would've only addressed (a). That said, we felt we would need to have a deeper understanding of
shapely
before using it. That is when we decided to take a step back and keep the validation simple.How can a reviewer manually see the effects of these changes?
geo-harvester/tests/test_records/test_validator.py
.make test
and confirm all tests are passingIncludes new or updated dependencies?
YES
Changes expectations for external applications?
NO
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/GDT-159
Developer
Code Reviewer(s)