-
Notifications
You must be signed in to change notification settings - Fork 5
Prevent XPathEvalError
s when Codelist Mapping XPath identifies non-attribute
#229
Conversation
… checks To deal with text() as a valid way of locating code values, there need to be a number of if-else statements added. It is less confusing and easier to read by having a single if-else at the top of the function, rather than lots in the middle.
This is a fix for #224 The Codelist README states the following in relation to mapping files: It's structured as a list of mapping elements, which each have a path element that describes the relevant attribute This indicates that it's incorrect that element text should match a Codelist. IATI, however, is special. This means that at the moment this is deemed something the Standard does. Until the restriction is removed, this fix is required. Yay, IATIland...
I did a silly and forgot to lint
iati/validator.py
Outdated
|
||
""" | ||
parent_el_xpath = '/'.join(split_xpath[:-1]) | ||
last_xpath_section = split_xpath[-1:][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is functionally equivalent to:
last_xpath_section = split_xpath[-1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I.e. the code that’s there makes a list of all items from the last one onwards, then takes the first item of that list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iati/validator.py
Outdated
@@ -290,32 +356,29 @@ def _check_codes(dataset, codelist): | |||
base_xpath = mapping['xpath'] | |||
condition = mapping['condition'] | |||
split_xpath = base_xpath.split('/') | |||
parent_el_xpath = '/'.join(split_xpath[:-1]) | |||
attr_name = split_xpath[-1:][0][1:] | |||
last_xpath_section = split_xpath[-1:][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as https://github.com/IATI/pyIATI/pull/229/files#r149662369 goes for this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while not parent_el_xpath.startswith('//'): | ||
parent_el_xpath = '/' + parent_el_xpath | ||
if parent_el_xpath.startswith('//['): | ||
parent_el_xpath = '//*[' + parent_el_xpath[3:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you’ll hit a merge conflict with #226 here.
|
||
located_codes = list() | ||
for parent in parents_to_check: | ||
located_codes.append((parent.text, parent.sourceline)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be:
located_codes.append((parent.text.strip(), parent.sourceline))
Then the valid test case could be something like:
[…]
<crs-add>
<channel-code>
21039
</channel-code>
</crs-add>
[…]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching of Codelists in attributes is exact, with whitespace being deemed incorrect. As such, stripping whitespace for element text doesn't seem like a correct interpretation. Maybe a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I agree for attributes… I’d be tempted to be more forgiving for element text. As a quite hand-wavy justification for this, IATI-Rulesets allows for whitespace around (for instance) an iati-identifier
:
- https://github.com/IATI/IATI-Rulesets/blob/c33d11885e1084bb99d897c9a5dc3bfcea91c0fb/iatirulesets/__init__.py#L63
- https://github.com/IATI/IATI-Rulesets/blob/c33d11885e1084bb99d897c9a5dc3bfcea91c0fb/rulesets/standard.json#L19-L20
…so the following passes (despite the whitespace):
<iati-identifier>
some-valid-identifier
</iati-identifier>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a quite hand-wavy justification, IATI-Rulesets allows for whitespace around an iati-identifier
In that same implementation, startswith
doesn't permit whitespace leading whitespace.
The source to go back to is the XPath spec which shows that text()
returns the entire text node. The XPath in the mapping file says that the comparison is against this value (rather than a trimmed or stripped version). As such, checking against a modified text node is not correct.
Further, the XML Spec shows that white space is significant and must be treated as part of a value - treating it as not-present is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that same implementation, startswith doesn't permit whitespace leading whitespace.
Well, but that’s not used in the standard ruleset, so I’m not sure that it’s relevant… But yes – I take your point and I think you’re quite right!
Just for reference: It looks like slovakaid are the only publisher currently using the element anyway, and they get it right. Also, very few publishers seem to make the mistake of leading/trailing whitespace in iati-identifier/text()
s (or other text nodes for that matter).
Rather than updating the schema, couldn’t you just update the IATI-Codelists README? Put another way: It seems we’re treating text nodes exactly like attributes… So can you explain why it’s a problem to use text nodes for codelist items? |
It's an unnecessary special case that complicates data use. The purpose of a Codelist is to restrict the values to a list that can be used between different systems. A Code has multiple components - a The only notable (for this use case) difference between the permitted content of elements and attributes is the ability to maintain newlines. There are a grand total of zero cases within the IATI Standard where it could possibly be necessary for a Code's The one potentially arguable case is Of all these cases within the Standard itself, there are zero reasons to use an element's text node over an attribute. All it does is complicate data use by having a single case that breaks all other conventions (IATI has quite a few of these). External to the Standard itself, the situation where this might possibly be relevant is that you wish to validate Basically, The Standard has zero need to use an element's text node over an attribute. It could, however, be beneficial in a wider content-validation sense, though this doesn't appear to be permitted within the current specification of components and should be discussed separately. |
Yeah, agreed – that makes sense to me. I take from this that while there’s nothing technically wrong here, it’s needlessly inconsistent, which is bad. That is: if a standard were to use text nodes for codelist items instead of attributes – but did it consistently – that would be a reasonable design choice. But mixing the two is bad. Is that fair? |
iati/validator.py
Outdated
else: | ||
error = ValidationError('warn-code-not-on-codelist', locals()) | ||
el_name = split_xpath[-2:-1][0] # used via `locals()` # pylint: disable=unused-variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes me a while to get my head around this notation every time! I’d write this as:
el_name = split_xpath[-2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More radical suggestion: I wonder if this whole bit might be more legible if you scrapped split_xpath
, used rsplit('/', 1)
all over the place, and passed strings around instead of lists of strings. So like:
parent_el_xpath, last_xpath_section = base_xpath.rsplit('/', 1)
Then this bit would be:
_, el_name = parent_el_xpath.rsplit('/', 1) # used via `locals()` # pylint: disable=unused-variable
Hmm on second thought that is still a bit ugly. Yes – feel free to ignore this suggestion! This pull request looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I clearly haven't properly got my head around negative array indexing >.< - super useful feature that few languages I've used much have)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the rsplit
version, it would require _extract_codes_from_attrib()
and _extract_codes_from_element_text()
to have different argument lists.
If these functions were public, that wouldn't be a great thing (lack of consistency). Since they're private, this isn't ideal, though is less of a problem.
It would, however, make the code a bit more DRY since last_xpath_section
and parent_el_xpath
would each only need calculating once - in _check_codes()
. Hum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¯\(ツ)/¯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...actually, the rsplit
version is cleaner and the parameter thing doesn't matter - it's fine that _extract_codes_from_attrib()
takes an extra parameter since it makes logical sense that you need to specify the attribute to extract data from. With _extract_codes_from_element_text()
you only need the element (or some way of finding it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm! Well... Different args for those functions seems intuitively correct, though... One needs to know am attribute reference; the other doesn't.
I will stop peddling my reckons on this now, because I'm unsure so I trust whatever you decide!
Seems like a good assessment of the situation, yep. |
As suggested in: #229 (review) Passing strings (rather than lists of strings) around is clearer. It also means that a couple of variables only need defining once. It is fine that the attrib and text extraction functions take different numbers of parameters since the former clearly needs more information to undertake its task than the latter.
Before this point, there were too many locals. This removes a couple of unnecessary ones, while also increasing clarity.
iati/validator.py
Outdated
|
||
for parent in parents_to_check: | ||
code = parent.attrib[attr_name] | ||
if last_xpath_section.startswith('@'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a function called _extract_codes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iati/validator.py
Outdated
|
||
return located_codes | ||
|
||
|
||
def _check_codes(dataset, codelist): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This badly needs refactoring. Function neither DRY nor SRP.
iati/validator.py
Outdated
if codelist.complete: | ||
error = ValidationError('err-code-not-on-codelist', locals()) | ||
if last_xpath_section.startswith('@'): | ||
if codelist.complete: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract setting of error name to a variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This separates responsibilities. It becomes necessary to define 'attr_name' twice, though this seems like a reasonable trade-off
A common prefix can be extracted. Therefore it has been extracted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smells like 🌹 s to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again xD
Fixes #224
It is generally assumed that a Codelist shall be used to check a value in an attribute. This is even stated in the IATI-Codelists README.
Unfortunately, not everything is straightforward in IATIland.
This PR adds support for the current (incorrect) implementation of
2.02
where//iati-activity/crs-add/channel-code/text()
is an XPath to validate against a Codelist. This may generalise, though should be deemed highly unstable and a likely candidate for removal since it is extremely questionable whether this should be permitted.Something that should be more stable is that a
ValueError
is raised when an unsupported XPath is provided as part of a mapping.IATI/IATI-Codelists#109 is also integrated.